Overview

What is Benchify?

Benchify is an automated code review service based on lightweight formal methods. Once you’ve installed Benchify in a repository, it will automatically review your code on every pull request. Given a pull request, Benchify takes the following sequence of actions.

  1. Benchify scans the diff to identify which functions or classes have changed.
  2. Benchify analyzes each changed function or class to determine its purpose.
  3. Based on this analysis, Benchify writes a series of logical properties defining the intended functionality of the modified code.
  4. Benchify spawns an ephemeral job runner to execute these tests.
  5. The job runner downloads the logical properties, clones the repository under test, writes the logical properties to the repository, and then analyzes them.
  6. The job runner reports the results of the test run to Benchify, which generates a human-readable report. This report is then posted as a series of comments on the pull request.

As a general policy, Benchify fails silently. What this means is that, if you install Benchify on a repo, open a pull request, and do not hear back within a few minutes, then probably there is something about your code which Benchify is not equipped to handle. For example, your code might be missing an import in its requirements.txt, or might require some specific package to be apt-installed, which Benchify does not (yet) know how to do. We are working on a system to turn run failures into suggestions (like, “why don’t you add this missing requirement to your requirements.txt”), but it is not yet ready for production use. You can confirm that Benchify ran, whether or not it did so successfully, by checking the Checks tab on your pull request.

How should I interpret Benchify’s results?

Each Benchify report follows the same format.

  • Summary: A brief description of the test results. This summary is AI-generated for your convenience as a comment on the PR.
  • Then, for each function in the diff, Benchify adds a comment with the following:
    • Results Table: A tabular summary of timeouts, exceptions, and failing and passing properties.
    • Unit tests: A drop-down with a list of unit tests you can run to reproduce Benchify’s results. Note that since Benchify may generate hundreds or even thousands of test cases, we only include the first few examples for each row in the table in the unit test file, for readability and brevity.

The table will look something like this:

FileInputsCountDescription
LeetCode/script.pyword='gprnQCkUF'13The minDistance between a word and itself was not zero.
LeetCode/script.pyword1='p', word2='p'100The minDistance function always returns a non-negative integer.
LeetCode/script.pyword1='Y', word2=''100The minDistance function is symmetric.
LeetCode/script.pyword1='wk', word2='', word3=''100The minDistance function satisfies the triangle inequality.

The first column in the table indicates the status of the property.

  • ❌ indicates that Benchify found at least one counterexample to the property. (The total number of counterexamples is listed in the Count column.)
  • ✅ indicates that Benchify found no counterexamples to the property. (The total number of passing examples is listed in the Count column.)
  • 💥 indicates that Benchify attempted to test the property, but ran into an exception. (The total number of times this happened for the given property, regardless of exception type, is listed in the Count column.)
  • ⏰ indicates that Benchify attempted to test the property, but the test timed out. (The total number of timeouts for the given property is listed in the Count column.) Benchify uses a default timeout value of one second.

Note that we make small improvements to the Benchify report format periodically, so the precise format is subject to change.