Often you may have inferences from multiple different models on the same dataset - this happens most commonly when you have trained a new model and want to compare it to an old model on a validation dataset.
Aquarium offers functionality to compare inferences from multiple models on the same dataset to see how your high-level metrics changed. You can also drill down into the exact circumstances where the inferences differ - where one model performs better or worse compared to the other.
To compare inferences, you will need to upload multiple sets of inferences with the same project and base dataset with the Python client API.
For example, uploading a labeled base dataset might look like the following:
python3 upload_to_aquarium.py \--aquarium-project example_project \--aquarium-dataset example_dataset
Now we may want to upload a set of inferences from model A on the base dataset:
python3 upload_to_aquarium.py \--aquarium-project example_project \--aquarium-dataset example_dataset \--inferences-id model_a
And then uploading another set of inferences from model B on the same base dataset:
python3 upload_to_aquarium.py \--aquarium-project example_project \--aquarium-dataset example_dataset \--inferences-id model_b
Once you have multiple sets of inferences for the same dataset, you can select a second set of inferences from the top query bar by clicking the "+" icon and selecting another set of inferences from the dropdown. In this document, we'll refer to those as the "base inferences" and "other inferences" respectively.
With a second set of inferences selected, you can now write queries that reference the labels in the labeled dataset, the base inference set, and the other inference set:
When comparing multiple sets of inferences, the label viewer will draw inferences with an additional diagonal hatching pattern -- bottom-left to top-right for the base inference set, and top-left to bottom-right for the other inference set.
When viewing multiple inferences for a semantic segmentation task, we present a separate view that captures pixel-wise changes.
On the left, we have the three masks rendered on top of the image. On the right, we have a diff overlay that shows overall correctness of the two inference sets, as well as where they improved/worsened. The divider between the sections can be dragged, to make certain images larger or smaller:
The diff map is colored according to the following scheme:
Blue: Both inference sets agree with the labels.
Yellow: Both inference sets disagree with the labels.
Green: The base inference set disagreed, and the other inference set agrees.
Red: The base inference set agreed, and the other inference set disagrees.
Each color in the legend can be toggled, to only show a subset of the diffs:
Below the images, we also have a classification report and confusion matrix for the image, which captures the relative performance change from the base inference set to the other inference set. Cells that improved will be colored a shade of green, while those that worsened will be colored a shade of red.
Clicking on a cell of the confusion matrix will further filter the diff view, to just the relevant pixels for that kind of confusion.
When comparing multiple inferences, the metrics view switches to rendering the relative difference in metrics from the base inference set to the other inference set:
Clicking on a cell of the confusion matrix will allow you to view samples where there was a change in that confusion between the two inference sets. For example, clicking on the cyclist -> pedestrian confusion cell will show us examples where the base inference set confused a cyclist for a pedestrian, and the other inference set fixed that confusion:
Similarly, we can also find times when a new confusion of that type was newly introduced by using the "New Confusion" tab.
For metrics that apply to the entire image, such as semantic segmentation confusion counts or user-provided metrics, clicking on a cell of the confusion matrix will compute the per-image change for that confusion type. You can then sort the returned images by the change in the per-image confusion count.