You, too, can help make Syft better! As you’re probably aware, Syft is a software composition analysis tool which is able to scan a number of sources to find software packages in container images and the local filesystem. Syft detects packages from a number of things such as source code and package manager metadata, but also from arbitrary files it encounters such executable binaries. Today we’re going to talk about how some of Syft’s binary detection works and how easy it is to improve.

Just recently, we were made aware of this vulnerability and it seemed like something we’d want to surface in Syft’s companion tool, Grype… but Fluent Bit wasn’t something that Syft was already detecting. Let’s look at how we added support for it!

Syft binary matching

Before we get into the details, it’s important to understand how Syft’s binary detection works today: Syft scans a filesystem, and a binary cataloger looks for files matching a particular name pattern and uses a regular expression to find a version string in the binary. Although this isn’t the only thing Syft does, this has proven to be a simple pattern that works fairly well for finding information about arbitrary binaries, such as the Fluent Bit binary we’re interested in.

In order to add support for additional binary types in Syft, the basic process is this:

  1. Find a binary
  2. Add a matching rule
  3. Add tests

Getting started

Starting with a local fork of the Syft repository, let’s work in the binary cataloger’s test-fixtures directory:

$ cd syft/pkg/cataloger/binary/test-fixtures

Here you’ll find a Makefile and a config.yaml. These are the main things of importance — and you can run make help to see extra commands available.

The first thing we need to do is find somewhere to get one of the binaries from — we need something to test that we’re actually detecting the right thing! The best way to do this is using a publicly available container image. Once we know an image to use, the Makefile has some utilities to make the next steps fairly straightforward.

After a short search online, we found that there is indeed a public docker image with exactly what we were looking for: https://hub.docker.com/r/fluent/fluent-bit. Although we could pick just about any version, we somewhat arbitrarily chose this one. We can use more than one, but for now we’re just going to use this as a starting point.

Adding a reference to the binary

After finding an image, we need to identify the particular binary file to look at. Luckily, the Fluent Bit documentation gave a pretty good pointer – this was part of the docker command the documentation said to run: /fluent-bit/bin/fluent-bit! It may take a little more sleuthing to figure out what file(s) within the image we need; often you can run an image with a shell to figure this out... but chances are, if you can run the command with a --version flag and get the version printed out, we can figure out how to find it in the binary.

For now, let’s continue on with this binary. We need to add an entry that describes where to find the file in question in the syft/pkg/cataloger/binary/test-fixtures/config.yaml:

- version: 3.0.2
    images:
      - ref: fluent/fluent-bit:3.0.2-amd64@sha256:7e6fe8efd51dda0739e355f58bf5e3b1623cbf2d4a23c06c7a365d9553e2d242
        platform: linux/amd64
    paths:
      - /fluent-bit/bin/fluent-bit

There are lots of examples in that file already, and hopefully the fields are straightforward but note the version — this is what we’ve ascertained should be reported and it will drive some functions later. Also, we’ve included the full sha256 hash, so even if the tags change, we’ll get the expected image. Then just run make:

$ make

go run ./manager download  --skip-if-covered-by-snippet
...
[email protected]
  ✔  pull image fluent/fluent-bit:3.0.2-amd64@sha256:7e6fe8efd51dda0739e355f58bf5e3b1623cbf2d4a23c06c7a365d9553e2d242 (linux/amd64)
  ✔  extract /fluent-bit/bin/fluent-bit

This pulled the image locally and extracted the file we told it to…but so far we haven’t really done much that you couldn’t do with standard container tools.

Finding the version

Now we need to figure out what type of expression should reliably find the version. There are a number of binary inspection tools, many of which can make this easier and perhaps you have some favorites — by all means use those! But we’re going to stick with the tools at hand. Let’s take a look at what the binary has matching the version we indicated earlier by running make add-snippet

$ make add-snippet

go run ./manager add-snippet
running: ./capture-snippet.sh classifiers/bin/fluent-bit/3.0.2/linux-amd64/fluent-bit 3.0.2 --search-for 3\.0\.2 --group fluent-bit --length 100 --prefix-length 20
Using binary file:      classifiers/bin/fluent-bit/3.0.2/linux-amd64/fluent-bit
Searching for pattern:  3\.0\.2
Capture length:         120 bytes
Capture prefix length:  20 bytes
Multiple string matches found in the binary:

1) 3.0.2
2) 3.0.2
3) CONNECT {"verbose":false,"pedantic":false,"ssl_required":false,"name":"fluent-bit","lang":"c","version":"3.0.2"}

Please select a match: 

Follow the prompts to inspect the different sections of the binary. Each of these actually looks like it could be something usable, but we want one that hopefully is simple to match across different versions. The third match has JSON, which possibly could get reordered. Looking at the second we can see something that has a string containing only 3.0.2 but let’s take a closer look at the first match. If we look at 1, we see something like the second that has a string containing only the version, <NULL>3.0.2<NULL>, but we also see %sFluent Bit, nearby. This looks promising! Let’s capture this snippet by following the prompts:

Please select a match: 1

006804fc: 2525 2e25 6973 0a00 252a 733e 2074 7970  %%.%is..%*s> typ
0068050c: 653a 2000 332e 302e 3200 2573 466c 7565  e: .3.0.2.%sFlue
0068051c: 6e74 2042 6974 2076 2573 2573 0a00 2a20  nt Bit v%s%s..* 
0068052c: 6874 7470 733a 2f2f 666c 7565 6e74 6269  https://fluentbi
0068053c: 742e 696f 0a0a 0069 6e76 616c 6964 2063  t.io...invalid c
0068054c: 7573 746f 6d20 706c 7567 696e 2027 2573  ustom plugin '%s
0068055c: 2700 696e 7661 6c69 6420 696e 7075 7420  '.invalid input 
0068056c: 706c 7567 696e 2027                      plugin '

Does this snippet capture what you need? (Y/n/q) y
wrote snippet to "classifiers/snippets/fluent-bit/3.0.2/linux-amd64/fluent-bit"

How could we tell the NULL terminators? What’s going on here? Looking at the readable text on the right, we see: .3.0.2., but the bytes are also displayed in the same position: 00 332e 302e 3200 and we know 00 is a NULL character because we’ve done quite a lot of these expressions. This is the hardest part, believe me! But if you’re still following along, let’s wrap this up by putting everything we’ve found together in a rule.

Adding a rule to Syft

Edit the syft/pkg/cataloger/binary/classifiers.go and add an entry for this binary:

                {
                        Class:    "fluent-bit-binary",
                        FileGlob: "**/fluent-bit",
                        EvidenceMatcher: FileContentsVersionMatcher(
                                // [NUL]3.0.2[NUL]%sFluent Bit
                                `\x00(?P<version>[0-9]+\.[0-9]+\.[0-9]+)\x00%sFluent Bit`,
                        ),
                        Package: "fluent-bit",
                        PURL:    mustPURL("pkg:github/fluent/fluent-bit@version"),
                        CPEs:    singleCPE("cpe:2.3:a:treasuredata:fluent_bit:*:*:*:*:*:*:*:*"),
                },

We’ve put the information we know about this in the entry: the FileGlob should find the file, as we’ve seen earlier, the FileContentsVersionMatcher takes a regular expression to extract the version. And I went ahead and looked up the format for the CPE and PURL this package should use and included these here, too.

Once we’ve added this, you can test it out right away by running your modified Syft code from the base directory of your git clone:

$ go run ./cmd/syft fluent/fluent-bit:3.0.2-amd64

 ✔ Pulled image                    
 ✔ Loaded image                                                                                                                                                          fluent/fluent-bit:3.0.2-amd64
 ✔ Parsed image                                                                                                                sha256:2007231667469ee1d653bdad65e55cc5f300985f10d7c4dffd6de0a5e76ff078
 ✔ Cataloged contents                                                                                                                 d3a6e4b5bc02c65caa673a2eb3508385ab27bb22252fa684061643dbedabf9c7
   ├── ✔ Packages                        [39 packages]  
   ├── ✔ File digests                    [1,771 files]  
   ├── ✔ File metadata                   [1,771 locations]  
   └── ✔ Executables                     [313 executables]  
NAME              VERSION                  TYPE     
base-files        11.1+deb11u9             deb       
ca-certificates   20210119                 deb       
fluent-bit        3.0.2                    binary    
libatomic1        10.2.1-6                 deb       
...

Great! It worked! If we try this out on some different versions, it looks like 3.0.1-amd64 works as well but this definitely did not work for 2.2.1-arm64 or 2.1.10, so we just repeat the process a bit and find out that we just need to make our expression a bit better to account for the variance in the arm64 versions having a couple extra NULL characters and the older versions not having the %s part. Eventually, this expression seemed to do the trick for the images I tried: x00(?P<version>[0-9]+.[0-9]+.[0-9]+)x00[^d]*Fluent.

We could have made this simpler — to just find <NULL><version><NULL>, but there are quite a few strings in the various binaries that match this pattern and we want to try our best to find the one that looks like it’s the specific version string we want. When we looked at the various bytes across a number of versions both the version and the name of the project showed up together like this. Having done a number of these classifiers in the past, I can say this is a fairly common type of thing to look for.

Testing

Since we already captured a test snippet, the last thing to do is add a test. If you recall, when we used the add-snippet command, it told us: 

wrote snippet to u0022classifiers/snippets/fluent-bit/3.0.2/linux-amd64/fluent-bitu0022

This is what we’re going to want to reference. So let’s add a test case to syft/pkg/cataloger/binary/classifier_cataloger_test.go, the very large Test_Cataloger_PositiveCases test:

                {
                        logicalFixture: "fluent-bit/3.0.2/linux-amd64",
                        expected: pkg.Package{
                                Name:      "fluent-bit",
                                Version:   "3.0.2",
                                Type:      "binary",
                                PURL:      "pkg:github/fluent/[email protected]",
                                Locations: locations("fluent-bit"),
                                Metadata:  metadata("fluent-bit-binary"),
                        },
                },

Wrapping up

Now that we have: 1) identified a binary 2) added a rule to Syft, and 3) added a test case with a small snippet, we’re done coding! Submit a pull request and sit back, knowing you’ve made the world a better place!