Fixing Model Failures


  • You have created a project in Aquarium, uploaded a dataset, and can visualize it in the webapp.

  • You have also uploaded your model inferences on that dataset and can visualize it in the webapp.

  • Ideally, you have embeddings available in the webapp, either generated by Aquarium or extracted + uploaded from your own model.


Diagnosing model failures implies that you are already confident in the quality of your dataset, and that relabeling the existing dataset will have at best diminishing returns.

Label quality issues and model failure issues frequently take different resolution paths; determining when an error is one or the other can save operational or iterative churn while trying to improve model performance.

Looking for General Areas of Improvement

First, navigate to the Embedding View on the Explore page. Here, you can color element by confusion (switch to plotting Crops if your primary task is object detection).

A cluster of "Disagree" in the relative center of the embedding space

Use the lassoing tool to select clusters of disagreements.

Using the popout in the top right corner, scan through the selected elements and confirm that the disagreements are a model failure. You can then add these selected elements to an issue.

Identifying The Prevalence of a Specific Failure

You may already know a specific failure case (whether it's by specific id or a combination of metadata filters) that you want to improve; first, use the query bar to find a representative element.

Alternately, by looking through the confusion matrix in the Metrics View or exploring the dataset in other ways, you've discovered some examples that look like a pattern, but you're unsure whether it's part of a larger model shortcoming.

We find an unexpected Shiba Inu/Samoyed misclassification

Add the known problem samples to an issue, and then go to the issue. Click on the "Similar Dataset Elements" tab, and then opt to "Calculate Similar Dataset Elements." This will run a nearest-neighbors search within all the elements in your dataset, sorted by nearness in embedding space to the elements already in the issue. Through this search, you can find similar examples add to the issue, or infer more details about the pattern. From these three examples, we might want to explore whether we have issues in general differentiating white dogs.

We see that the similarity search did return primarily white dogs, but that the model correctly classifies them for the majority of samples.

Improving Edge Case Performance

In the shiba inu example, we concluded that the errors are more of an edge case than a more widespread metadata problem; rather than an issue with white dogs in general, the model struggles with white shibas. If we want to improve on this edge case, our next step is to go out and collect more labeled samples of white shiba inus to train the model on.

To do so, we could leverage Collection Campaigns, which can take a corpus of unlabeled data and identify the subset we actually want to label without needing to manually review the full corpus of collected samples, and make sure that we're sending enough examples of white shibas to our labeling service.

Measuring Impact and Validating Changes

Once you've made adjustments to your dataset or to your model, you can retrain and upload the results to Aquarium. Using the Project Summary page, you can compare F1/Precision/Recall metrics across datasets.

If you can upload new inferences on the same labeled dataset in Aquarium, you can click "Compare" in the Metrics View to directly compare the results against each other.

Per class improvements - in this comparison, false-positives decreased by 5
Confusion metrics improvements