Fixing Data + Label Quality Issues

Prerequisites

  • You have created a project in Aquarium, uploaded a dataset, and can visualize it in the webapp.

  • You have also uploaded your model inferences on that dataset and can visualize it in the webapp.

  • Ideally, you have embeddings available in the webapp, either generated by Aquarium or extracted + uploaded from your own model.

Motivation

The most important determinant of model performance is the quality of the data it's trained on. Usually the first thing you want to do when working with machine learning is to inspect the dataset and make sure that the data and labels conform to your expectations.

If you can't trust your data / labels, then you also can't trust your accuracy metrics on that data. Clean evaluation datasets give you confidence that your accuracy metrics correctly measure the performance of your model.

Training your model on bad data can also confuse your model and lead to degradations in performance. Fixing corrupted data and incorrect labels + retraining on clean data can lead to up to 18% improvements in model performance! Here's an example of how fixing bad labels improved the performance of one of our customers.

Inspect The Data For Obvious Issues

The grid view allows you to inspect your data at a glance and spot obvious issues.

First, navigate to your project and click "explore" on the dataset you'd like to work on. You will be presented with a view of your datapoints, your labels, and possibly your inferences if they're available.

You will want to spend some time scrolling through this view to see what your dataset "looks" like. This is a great way to see if there are any obvious issues in the dataset. Be on the lookout for issues like erroneous data (completely black images, images with strange artifacts, etc.) or incorrect labels (a picture of a dog that was labeled as a cat).

If you find bad datapoints, you can select them with the circle on the top-left of the view and use the "Add To Issue" button to add them to a collection. To bulk add datapoints, you can hold shift and then click the circle, which should save you a lot of clicking.

If you'd like to change the number of datapoints displayed per row, click the "Display Settings" button and change the "Cards Per Row" setting.

Understand The Data Distribution

Once you've spent some time and are sure the data seems reasonable, it's important to zoom out and understand the distribution of the dataset. You can click into the Histogram View on the left side of the screen, select various pieces of metadata, and see the distribution of that metadata across the dataset.

We see that there's a relatively even number of each class across the dataset.
However, there are significantly fewer cats than dogs!

If you find a "slice" of data that's of interest of you, you can go to any UI view and apply a filter on the data.

Viewing only cats in the grid view.

This is especially useful for finding types of data that you may need to collect more of. It can also help you examine specific data of interest to gain intuition of what it looks like and make sure that there are no data or label errors in that data.

Looking At High Loss Examples To Find Labeling Errors

Some datasets are very large, making it tedious to scan through the entire dataset to catch labeling mistakes that can be relatively rare. Luckily, Aquarium has functionality to quickly surface labeling errors out of large datasets to make this process faster.

If you have trained a model on the dataset, you can find labeling errors by finding the places where the model inferences disagree most with the labels. Select a dataset and an inference set to compare, then go to the Model Metrics View to see the model's accuracy stats.

You can then choose a class to examine in more detail - typically, you want to dive deeper into a class that is particularly important for your application or that has unusually poor accuracy. Then you can click on a section on the confusion matrix box and sort by most confident / largest boxes to see the examples where the model inferences and labels diverge the most. This should surface a lot of places where the model is confident and correct, but the labels are wrong! You can then select these mislabeled datapoints and add them to an issue using the "Add To Issue" button.

In the following example, we look at the model metrics view and notice the precision of our car detector is fairly low. We click on the bottom left cell of the confusion matrix to see datapoints labeled as background that the model detected as car. We see a lot of cases where the model correctly detects cars that are not labeled!

Looking at the model metrics view, we notice here that the precision of our car detector is fairly low.
In fact, there's a lot of cases where the model correctly detects cars that are not labeled!

Using Issues To Export Data

Once you've added some labeling errors to an issue, you can examine issue collections and then export the data outside of Aquarium to fix or exclude the bad data.

First, you can go to the Issues page to examine all currently created issues and to examine them in greater detail in case you want to modify the issue elements before export.

The fastest way to export data is to download a JSON / CSV. You can select multiple issues from the main Issues view using the checkboxes and click the "JSON" button.

Downloading JSONs from the Issues View

Alternately, you can click into an issue and also download the JSON there. You can also alter the dropdown at the top of the page to mark the resolution status of this issue - whether it's still being triaged, whether it's currently being resolved, or has already been fixed.

The UI for an individual issue allows you to download as well.

You can also programmatically pull down or modify issues using the Python Client API.

Measuring The Impact

Once you've fixed a bunch of data / label quality issues, you probably want to retrain your model against the cleaned dataset and then measure whether the new model is better.

If you upload inferences from your old model and new model on the same dataset, you can compare their performance in Aquarium. To view a high level comparison of your models, you can select the two inference sets from the projects cards page to see the difference in performance at a glance.

If you click "Compare" and go to the Model Metrics view, you can then see the diffs in high-level metrics, click on confusion matrix boxes, and zoom into the precise places where the models differed.