Analyzing Model Inferences

Once you've uploaded a model's inferences on a dataset through the Aquarium API, you can then begin to analyze your model's performance and more efficiently find insights in the underlying datasets.

Ultimately, the goal of most ML teams is to improve their model performance, so it's important to understand where your model is doing well / badly in order to improve it. We can also use to surface what parts of the dataset we should pay more attention to, either because the model is performing badly or because there's a problem with the underlying data.

To view inferences on a dataset, select a base dataset, choose a corresponding inferences set, and then click "Search." The Grid View will now render the data with both labels and model inferences. In this example, labels are rendered with solid lines and inferences with dashed lines.

Model Metrics View

The Model Metrics View is the fourth icon to the right of the search button.

Here, you can see a high level overview of the model's performance on the base dataset in a few forms:

You can also move the slider for parameters like confidence threshold and IOU matching threshold, and the metrics will recompute to reflect the model's performance with those parameters.

You can also click into an example in the confusion matrix, see examples of those types of confusions, and sort by the amount of confusion.

Exploring false positive car detections.

Many Class Metrics View

When the number of classes exceeds 25, the model metrics view switches from the grids above to a filterable table format for better legibility.

Selecting a row in the Classification Report (left) will filter the Confusion Matrix (right) to only rows where either GT or Prediction matches the selected class. Selecting a row in the Confusion Matrix will show examples of those types of confusions, and sort by the amount of confusion.

Surfacing Labeling Errors

Aquarium makes it easy to find the places where your model disagrees most with the data. Once you click on a confusion matrix box, you can sort examples to see high confidence disagreements, low confidence agreements, or by other common factors like IOU or box size.

These "high loss" examples tend to expose areas where the model is making egregious mistakes or places where the model is right and the underlying labels is wrong! In the following example - showing the most confident examples where the model detected cars where there was no corresponding label - most issues are due to missing labels on cars that the model correctly detects!

Most high confidence false positives are due to missing labels!

Whenever you upload a new model inference set to Aquarium, we highly recommend looking through some of these high loss examples to see if you have any labeling mistakes.

Finding Model Failure Patterns

You can color datapoints in the embedding view based on model precision, recall, and F1. This lets you identify trends in model performance by finding which parts of the dataset the model does particularly well / badly.

Our model has very good accuracy on pedestrians on a subset of the dataset...
Upon further inspection, it's because they're all the same scene! The model is overfitting to that scene.

When switching to crop embeddings, you can color datapoints by confusion type to identify object-level failure patterns.

By clicking the entries in the class legend on the left, we can toggle the visualization to only show false positive scenarios.

We can identify a cluster of false positive detections on the same object across multiple different frames.


Aquarium automatically computes performance metrics for common tasks. Some of these, like image classification, are well defined and consistent across most users. Others, like object detection, tend to vary in implementation from one library to another.

Here are some more details about our interpretation of common tasks and how we compute their metrics. If you have different logic you use to evaluate performance, you can also provide your own custom metrics.

Object Detection

In order to offer a confusion matrix (vs just FP/TP/FN for each class), we have to consider all detection and label classes simultaneously, as opposed to considering class each individually like you might see in pycocotools and some other libraries/benchmarks.

That means we model it as a two-phased process:

  • A global matching stage (with unmatched entries implicitly matching to 'background')

  • Once we have them matched, doing standard pairwise TP/FP metrics on them, like in a typical classification task.

However, the matching phase has several interpretations, and the most "correct" one seems to vary a bit based on domain and model architecture. This is what we chose to do as a default.

Matching Logic

We first apply a confidence threshold, where we discard all inferences below that threshold.Then we match GT labels with the remaining inferences based on optimal geometry overlap / localization (typically optimizing IOU).

To properly punish multiple detections, we consider each label and inference to be "used up" based by their best-localization matching. That is, a label can only be matched to either 0 or 1 inference.

In most cases, that tends to be beneficial and more accurate to real world performance, where objects rarely perfectly overlap and false-positives are bad. In others, especially ones where false positives are not a major problem, it can be a sub-optimal metric. In these cases, we recommend you provide your own custom metric in the form of a per-frame confusion matrix.

Why not NMS? Why a confidence threshold instead of sorting by descending confidence?

Non-Maximum Suppression (NMS) solves a similar problem of prioritizing detections based on a combination of confidence and IOU. The main reason we don't use it directly is that we assume most customers are doing NMS somewhere in their stack, and that ultimately they ship a confidence threshold.

So assuming they're going to pick some cutoffs for what they'd want to run in production, we'd want to start with a large filtering of lower-confidence first. This way we're computing metrics that resemble the final shipped product's performance, as opposed to a metric that more abstractly measures the model's generic performance.

We've seen that if your implementation already does some NMS internally, then at this point you don't care as much about the difference between 0.92 confidence and 0.85 confidence, and so would rather prioritize localization (especially in domains with lots of partial overlap).