Collect Relevant Data

How to collect relevant data in your labeled and unlabeled datasets using Aquarium

Overview

It is no easy task to decide what data you should label next and add to your training set. Determining which data will make the greatest impact to improving your model also presents its own set of challenges.

By focusing data collection and labeling on highest value data, you can get more model improvement in less time and with less labeling costs than random sampling.

If you have a large unlabeled dataset, Aquarium's Collection Campaign segment helps you quickly collect the subset you actually want to use—without the need for someone to manually review the entirety of your unlabeled data.

Utilizing Collection Campaigns requires setting up a Collection Campaign segment within your dataset. Learn more about organizing your data with Segments.

Aquarium enables you to analyze your datasets to determine if there are underrepresented areas within your data or areas where your model struggles. Once you have identified these difficult cases, you can group your data into a Collection Campaign segment and can find more examples similar to these in the unlabeled dataset. You can then send these examples to a labeling provider, use them to retrain your model, and get the most model improvement for the least labeling cost!

In this guide, the main steps we will cover are:

  • Navigating a Collection Campaign segment

  • Kicking off a similarity search through an unlabeled dataset

  • Exporting your newly collected data from Aquarium

Once completed you should feel comfortable:

  • Searching unlabeled datasets using example data you've collected

  • Reviewing the results of the similarity search the process of exporting your newly found data

Prerequisites

This guide makes the assumption that you have already found subsets of your data of interest for targeted unlabeled data collection, and added them to a Collection Campaign type segment. Your teams can use Aquarium's various views to accomplish the task of understanding where your training dataset could benefit from additional, targeted data.

We have an entire guide dedicated to the process of assessing your data quality here.

In summary, Aquarium has tools that can help you find areas of confusion, low model metrics scoring, and sparse representation in order to target your data collection towards datapoints that are most helpful to improving your model.

Embeddings must be generated for the unlabeled dataset being searched through.

  • Whatever model you use to generate the new corpus's embeddings must be the same model you used to generate the Issue elements' embeddings.

  • See this guide for uploading unlabeled data correctly using the embedding_version parameter

In order to follow the step-by-step instructions, this guides makes the assumption you have already:

Image vs Object Level Data Curation

Each Collection Campaign segment will be of type frame or type crop.

If you want to search unlabeled datasets for similar images, make sure your Collection Campaign Segment is type Frame.

If you want to search unlabeled datasets for similar objects, make sure your Collection Campaign Segment is type Crop.

For more information on how to create a frame or crop specific Collection Campaign segment follow this guide.

Collecting Relevant Data User Guide

This guide runs through the complete flow of setting up a Collection Campaign and collecting new, unlabeled data similar to those you've previously identified in a Collection Campaign Segment.

1. Navigate to Your Collection Campaign Segment

At this point the idea is that you have already created a Collection Campaign segment. If you need help knowing how to use Aquarium to assess your data quality and find subsets of data to put into a Collection Campaign segment, check out this guide!

In the top navigation bar in Aquarium click the "Segments" button to be brought to the Segments page.

Once on the Segments page, you'll be able to view all your created segments. Navigate to the Data Collection tab to view all of your created Collection Campaign segments.

Once you have selected which Collection Campaign you would like to work with, click on the name in the table to view details regarding your specific segment.

2. Click on "Collected" Tab

For more information regarding a Collection Campaign segment, read here.

If you have not run a Collection Campaign before on this segment, your screen will look like this:

In the collected tab, if you have properly uploaded an unlabeled dataset you will see a dropdown with all of the valid unlabeled datasets that you are able to search through.Depending on your goals for the similarity search, it may make sense to split your unlabeled datasets up in different ways when uploading.

Click the button to the right of your indicated unlabeled dataset that says "Calculate Similar Dataset Elements", and you'll see the text below change to reflect the status of you similarity search. You'll also see a green bar pop up at the top of your screen indicating the similarity search has started.

Results can take anywhere from 10 seconds to a couple minutes as Aquarium compares your subset of data to the indexed unlabeled dataset. Once returned you will see your screen look like this:

You can scroll through the returned images and use the Display Settings and the Sort By Ascending or Descending Similarity Score to view the elements.

At this point you could export your data and follow step 6 in this guide. But for even better results we may want to take it a step further and refine your search results.

5. Refine Your Search Results

Within Aquarium, you can iteratively refine your search results after reviewing the initial proposed elements.

To get started, you should set the review status for each returned element on a page,! You can set the review status by:

  • Clicking the thumbs up button to change the status to accepted

  • Clicking the thumbs down button to change the status to rejected

To minimize clicks, use the default status to change all elements on a page to accepted or rejected, and then manually swap any individual outliers. This is most useful when the elements are primarily good or bad, and gets less useful as they are split 50/50 good/bad within the page.

Once you've set the correct status on the elements using a combination of the default status and the individual thumbs up / thumbs down, hit the submit button on the right side of the window to confirm those statuses and move on to the next page.

Only elements with confirmed statuses will be moved from the unsorted tab into the accepted or rejected tabs!

Repeat this process as many times as needed to refine your newly collected dataset.

Recalculating Similar Elements

Once you have accepted and rejected a portion of your data, you can rerun a similarity search using your selections as part of the search seed.

In order to return more/new results from your collection campaign all you need to do is click the Recalculate Similar Dataset Elements button.

The recalculate button may not always be present. In order to recalculate your unlabeled results, you'll need to meet one of the following conditions in order to trigger the button to appear:

  1. The seed dataset changes

    1. Add or remove frames or crops from the labeled element set in the Collection Campaign segment

  2. The search space changes

    1. Change the unlabeled dataset you are searching through

    2. Add or remove frames or crops from the unlabeled set

    3. Add or remove frames or crops from the labeled dataset

  3. Adjust the Precision/Recall slider in Collection Settings

  4. An input to the classifier changes

    1. Accept at least 10 results and reject at least 21 results

      1. OR accept/reject at least one and click override for the accept/reject classifier thresholds

6. Export Your Collected Data

Once you have completed running your similarity search and your data refinement, Aquarium provides two options of how to export your newly created dataset:

  1. Batch export to JSON

  2. Use a webhook to export your data directly to a labeling provider.

We have separate pages in our docs dedicated to exporting data out of Aquarium. These docs will show you how the export data is formatted as well as things like how to set up a webhook with a labeling provider. To access both options, use the dropdown button in the top right corner of your screen to select which export option you would like to use. Note if you have not set up a webhook to the labeling provider the button will be greyed out.

Within the Unsorted, Accepted, and Discarded tabs, you can also select individual elements to export instead of all of the data contained in the tab.

Your download will start immediately and depending on how much data you are exporting can take a little longer, but the download should start within a few seconds.

And congrats! You have successfully located new targeted subsets of your unlabeled data to then label and add into your training set in order to improve model performance!

Have questions about other export formats or want to discuss a more custom option to the workflow in this guide? Please feel free to reach out to us here.

Viewing Your Results

Page Sizes

You review proposed elements that were found in the unlabeled datasets in pages. Each page is 20 elements by default but you can increase or decrease the number of elements in each page using the Collection Settings for easier reviewing!

Adjusting Precision and Recall

While reviewing your results, you'll see this button:

You will see a slider appear with the left bound being High Precision and right bound being High Recall.

  • Higher precision will reduce the search radius and usually return fewer results, but they'll be more similar to the seed.

  • Higher recall will increase the search radius and usually return more results, but they may be less similar to the seed.

Generally if you're seeing good results, no need to change. If you're not seeing enough results you can swap to higher recall at the cost of needing to do a bit more review.

Last updated