Analyzing Model Inferences

Once you've uploaded a model's inferences on a dataset through the Aquarium API, you can then begin to analyze your model's performance and more efficiently find insights in the underlying datasets.

Ultimately, the goal of most ML teams is to improve their model performance, so it's important to understand where your model is doing well / badly in order to improve it. We can also use to surface what parts of the dataset we should pay more attention to, either because the model is performing badly or because there's a problem with the underlying data.

To view inferences on a dataset, select a base dataset, choose a corresponding inferences set, and then click "Search." The Grid View will now render the data with both labels and model inferences. In this example, labels are rendered with solid lines and inferences with dashed lines.

Model Metrics View

The Model Metrics View is the fourth icon to the right of the search button.

Here, you can see a high level overview of the model's performance on the base dataset in a few forms:

You can also move the slider for parameters like confidence threshold and IOU matching threshold, and the metrics will recompute to reflect the model's performance with those parameters.

You can also click into an example in the confusion matrix, see examples of those types of confusions, and sort by the amount of confusion.

Exploring false positive car detections.

Many Class Metrics View

When the number of classes exceeds 25, the model metrics view switches from the grids above to a filterable table format for better legibility.

Selecting a row in the Classification Report (left) will filter the Confusion Matrix (right) to only rows where either GT or Prediction matches the selected class. Selecting a row in the Confusion Matrix will show examples of those types of confusions, and sort by the amount of confusion.

Surfacing Labeling Errors

Aquarium makes it easy to find the places where your model disagrees most with the data. Once you click on a confusion matrix box, you can sort examples to see high confidence disagreements, low confidence agreements, or by other common factors like IOU or box size.

These "high loss" examples tend to expose areas where the model is making egregious mistakes or places where the model is right and the underlying labels is wrong! In the following example - showing the most confident examples where the model detected cars where there was no corresponding label - most issues are due to missing labels on cars that the model correctly detects!

Most high confidence false positives are due to missing labels!

Whenever you upload a new model inference set to Aquarium, we highly recommend looking through some of these high loss examples to see if you have any labeling mistakes.

Finding Model Failure Patterns

You can color datapoints in the embedding view based on model precision, recall, and F1. This lets you identify trends in model performance by finding which parts of the dataset the model does particularly well / badly.

Our model has very good accuracy on pedestrians on a subset of the dataset...
Upon further inspection, it's because they're all the same scene! The model is overfitting to that scene.

When switching to crop embeddings, you can color datapoints by confusion type to identify object-level failure patterns.

By clicking the entries in the class legend on the left, we can toggle the visualization to only show false positive scenarios.

We can identify a cluster of false positive detections on the same object across multiple different frames.