You might be in for a bit of a surprise when running the latest version of Grype – potential vulnerabilities you may have become accustomed to seeing are no longer there! Keep calm. This is a good thing – we made your life easier! Today, we released an improvement to Grype that is the culmination of months of work and testing, which will dramatically improve the results you see, in fact some ecosystems can see up to an 80% reduction of false positives! If you’re reading this, you may have used Grype in the past and seen things you weren’t expecting, or you may just be curious to see how we’ve achieved an improvement like this. Let’s dig in.
The surprising source of false positives
The process of scanning for vulnerabilities involves several different factors, but, without a doubt, one of the most important is for Grype to have accurate data: both when identifying software artifacts and also when applying vulnerability data against those artifacts. To address the latter, Anchore provides a database (GrypeDB), which aggregates multiple data sources that Grype uses to assess whether components are vulnerable or not. This data includes the GitHub Advisory Database and the National Vulnerability Database (NVD), along with several other more specific data sources like those provided by Debian, Red Hat, Alpine, and more.
Once Grype has a set of artifacts identified, vulnerability matching can take place. This matching works well, but inevitably may result in certain vulnerabilities being incorrectly excluded (false negatives) or incorrectly included (false positives). False results are not great, either way, and the false positives certainly constitute a number of issues reported against Grype over the years.
One of the biggest problems we’ve encountered is the fact that the data sources used to build the Grype database use different identifiers – for example, GitHub Advisory Database uses data that includes a package’s ecosystem, name, and version; while NVD uses the Common Platform Enumeration (CPE). These identifiers have some trade-offs, but the most important of which is how accurate it is for a package to be matched against the vulnerability record. In particular, the GitHub Advisory Database data is partitioned by ecosystems such as npm or Python whereas the NVD data does not generally have this distinction. The result of this is a situation where a Python package named “foo” might match vulnerabilities against another “foo” in another ecosystem. When taking a closer look at reports by the community, it is apparent that the most common reason for reported false positives is due to CPEs matching.
Focusing on the negative
After experimenting with a number of options for improving vulnerability matching, ultimately one of the simplest solutions proved most effective: stop matching with CPEs.
The first question you might ask is: won’t this result in a lot of false negatives? And, secondly, if we’re not matching against CPEs, what are we matching against? Grype has already been using GitHub Advisory Database data for vulnerability matching, so we simply leaned into this. Thankfully, we already have a way to test that this change isn’t resulting in a significant change in false negatives: the Grype quality gate.
One of the things we’ve put in place for Grype is a quality gate, which uses manually labeled vulnerability information to validate that a change in Grype hasn’t significantly affected the vulnerability match results. Every pull request and push to main runs the quality gate, which compares the previously released version of Grype against the newly introduced changes to ensure the matching hasn’t become worse. In our set of test data, we have been able to reduce false positive matches by 2,000+, while only seeing 11 false negatives.
Instead of focusing on how we reduce the false positives, we can now focus on a much smaller set of false negatives to see why they were missed. In our sample data set, this is due to 11 Java JARs that don’t have Maven group, artifact, or version information, which brings up the next area of improvement: Java artifact identification.
When first exploring the option to stop CPE matching there were a lot more than 11 false negatives, but it was still a manageable number – less than 200 false negatives are a lot easier to handle than thousands of false positives. Focusing on these, we found almost all of these were cases where Java JARs were not being identified properly, so we improved this, too. Today, it’s still not perfect – the main reason being that some JARs simply don’t have enough information to identify accurately without using some sort of external data (and we have some ideas for handling these cases, too). However, the majority of JARs do have enough information to accurately be identified. To make sure we weren’t regressing on this front, we downloaded gigabytes (25+ GB) of JARs, scanned, and validated that we are finding the right information to correctly extract the correct names and versions from these JARs. And much of this information ends up being included in the labeled vulnerability data we use to test every commit to Grype.
This change doesn’t mean all CPE matching is turned off by default, however. There are some types of artifacts that Grype still needs to use CPE matching for. Binaries, for example, are not present in the GitHub Advisory Database and Alpine only provides entries for things that are fixed, so we need to continue using CPE matching to determine the vulnerabilities before querying for fix information there. But, for ecosystems supported by the GitHub Advisory Database, we can confidently use this data and prevent the plethora of false positives associated with CPE matching.
GitHub + Grype for the win
The next question you might ask is: how is the GitHub Advisory Database better? There are many reasons that the GitHub data is great, but the things that are most important for Grype are data quality, updatability, and community involvement.
The GitHub Advisory Database is already a well-curated, machine-readable collection of vulnerability data. A surprising amount of public vulnerability data that exists isn’t very machine readable or high quality, and while a large volume of data that needs updates isn’t a problem by itself, it is a problem when the ability to provide such updates is nearly impossible. GitHub can review the existing public vulnerability data and update it with relevant details by correcting descriptions, package names, version information, and inaccurate severities along with all the rest of the captured information. Being able to update the data quickly and easily is vital to maintain a quality data set.
And it’s not just GitHub that can contribute to these data corrections – because the GitHub Advisory Database is stored in a public GitHub repository, anyone with a GitHub account can submit updates. If you notice an incorrect version or spelling mistake in the description, the fix is one pull request away. Since GitHub repositories are historical archives, in addition to just submitting fixes, is the ability to look back in time at discussions, decisions, and questions. Much of the public vulnerability data today lacks transparency. Decisions might be made in private or by a single person, with no record of why. With the GitHub Advisory Database, we can see who did what, when, and why. Having a strong community makes open source work and using the open source model with vulnerability data works great too.
We've got your back
We believe this change will be a significant improvement for all Grype users, but we don’t know everyone’s situation. Since Grype is a versatile tool, it’s easy to enable CPE matching, if that’s something you still want to do. Just add the appropriate options to your .grype.yaml file or use the appropriate environment variables (see the Grype configuration for all the options), for example:
We want to ensure Grype is the best vulnerability scanner that exists, which is a lofty goal. Today we made a big stride towards this goal. There will always be more work to do: better package detection, better vulnerability detection, and better vulnerability data. Grype and the GrypeDB are open source projects, so if you would like to help please join us.
But today, we celebrate saying goodbye to lots of false positives, so keep calm and scan on, your list of vulnerabilities just got shorter!