Measuring Vulnerability Scanner Quality with Grype and Yardstick

Introducing Yardstick

As we build Grype, our open source container vulnerability scanner, we are constantly thinking about the quality of our results and how to improve them. We have developed a number of methods to measure our results at development time, so that our scanner doesn’t regress as we change our scanning logic and how we parse sources of external data. We’ve incorporated some of these methods into a new tool: Yardstick, which inspects and compares the results of vulnerability scans between different scanner versions.

The most important thing for any vulnerability scanning software is the quality of its results. How do you measure the quality of vulnerability scan data? How do you know if your scanner quality is improving or declining? What impact do code changes have on the quality of your results? How about new or updated sources of external data? Can we incorporate these and prove that the scanner results will get better?

Yardstick aims to answer these questions by characterizing matching performance quantitatively.

Vulnerability Scan Quality

A basic approach to measuring the quality of a vulnerability scan over time might be to simply compare the results from one version to another, for the same container image. But this will only tell us if the results changed, not whether they got better or worse without manually looking at all of the results. There are a number of factors that could change the results of a scan:

Code changes in the scanner itself
New vulnerabilities added to upstream data sources
Existing vulnerabilities might be changed or removed from upstream data sources
The artifacts being scanned might have changed in some way, or our understanding of the contents of those artifacts might change because of changes to Syft (our SBOM generator, which Grype uses to analyze the contents of artifacts being scanned.)

To move beyond simple result change detection we’ve hand-curated a set of ever-growing examples of labeled data from real container images. These “labels” are used as ground truth to compare against vulnerability results. We use the F1 score (a combination of True Positive, False Positive, and False Negative counts) and a few simple rules to make up Grype’s quality gate. Get more technical information on our scoring.

Positives and Negatives

For the most accurate results, we want to maximize “True Positives” while minimizing “False Negatives” and “False Positives”:

True Positive: A vulnerability that the scanner correctly identifies. (good!)

False Positive: A vulnerability that was reported but should not have been. (bad, but not as bad as a false negative.)

False Negative: A vulnerability that the scanner should have reported, but didn’t. (bad!)

We have integrated Yardstick into our test and build infrastructure to compare the scan results from different versions of Grype, so that we can identify regressions in our vulnerability matching techniques. We also integrate a lot of external data from various sources, and our goal is to open the process by which the Grype vulnerability database is populated so that our community can add additional sources of data. All of this means that we need robust and comprehensive tools to ensure that our quality stays high.

Right now, Yardstick only has a driver for Grype, but it is extensible, so it’s possible to add support for other vulnerability matchers. We would be happy to see pull requests from the community to improve Yardstick’s capabilities, and we’d be happy to hear if Yardstick is useful when you use a vulnerability scanning tool.

What does it look like?

Here are some screenshots and an animation to show you what Yardstick looks like in operation:

Want to try it out? You can find instructions in our GitHub repository, and please feel free to visit our Discourse forum to ask questions and chat with our developers and security experts.

Frequently Asked Questions:

Q: Why didn’t you call it “Meterstick”?

A: In 1793, a ship sailing from Paris to America carrying objects to be used as references for a standard kilogram and meter was thrown off course by a storm, washed up in the Caribbean, and raided by British pirates who stole the objects. By the time a second ship with new reference pieces set sail, the United States had already decided to use the Imperial system of measurement. So, we have Yardstick. (source)

Q: If I just want to scan my containers for vulnerabilities, do I need to use Yardstick?

A: No, Yardstick is intended more as a tool for developers of vulnerability scanners. If you just want to scan your own images, you should just use Grype. If you want to participate in the development of Grype, you might want to explore Yardstick.

Q: Can Yardstick compare the quality of SBOMs (Software Bill of Materials)?

A: Not yet, but we have designed the tool with this goal in mind. If you’re interested in working on it, chat with us! PR’s appreciated!

Q: Can Yardstick process results from other vulnerabilities besides Grype?

A: Not yet, but PRs accepted!

Measuring Vulnerability Scanner Quality with Grype and Yardstick

Introducing Yardstick

Vulnerability Scan Quality

Positives and Negatives

What does it look like?

Frequently Asked Questions:

How to Respond When Your Customers Require an SBOM (and Even Write It Into the Contract!)

The SBOM Paradox: Why ‘Useless’ Today Means Essential Tomorrow

SCA vs. SBOM: How They Differ & Why They Work Best as a Team

Speak with our security experts