Collection Campaign

Take a look at Aquarium's Collection Campaign Preview

If you have a large corpus of new, unlabeled data, Aquarium's Collection Campaign feature helps you quickly collect the subset you actually want to use—without the need for someone to manually review the full corpus.

Based on a set of difficult edge cases identified in a pre-existing Issue, you can find more examples similar to these. You can then send these examples to a labeling provider, use them to retrain your model, and get the most model improvement for the least labeling cost!

This feature is currently in early preview (pre-alpha). Please reach out to us and let us know if you'd like to give it a try!

User Guide (Pre-Alpha)

This feature is currently still in pre-alpha 🧪, with active improvements in the pipeline. As a result, you may encounter some rough edges when using it.

Please direct any feedback or issues back to Aquarium staff using your most convenient contact method. Thanks for trying this out!

This guide runs through the complete flow of setting up a Collection Campaign and collecting new, unlabeled data similar to those you've previously identified in an Issue.

The flow will be something like this:

Collection Campaign Flow

Requirements

In order to successfully create a Collection Campaign, the following requirements must be met:

  • This feature will only work on Issues where all of the contained elements come from Datasets and Inference Sets uploaded on or after January 11th, 2021.

  • All elements within the Issue must be from the same Dataset or Inference Set.

    • NOTE: This means that an Issue can't have an element from a dataset and an element from the dataset's corresponding inference set. Those count as distinct sets.

  • Embeddings must be generated for the data corpus being searched through.

    • Whatever model you use to generate the new corpus's embeddings must be the same model you used to generate the Issue elements' embeddings.

  • The data corpus must have its data accessible to Aquarium (URLs, GCS paths, etc.), much like how it currently is for uploaded Datasets and Inference Sets.

1. Start a Collection Campaign (Web App)

In order to start a Collection Campaign, first navigate to the Issues tab in the web app.

Since this feature is still in pre-alpha, you'll need to flip a feature flag to enable collection campaigns. To do this, go into your browser's Javascript console and enter the following command:

window.localStorage.setItem("aquarium_feature_flag_collection_campaigns", true);

After hard-refreshing (ctrl-shift-R), this should enable the Collection Campaign interface for your current browser session.

Now, navigate to an Issue that contains the sort of data that you want more examples of. (If you haven't done this yet, create such an Issue).

When you go to that Issue's page, you should see a Collection Campaign box in the right panel. Click the Start Campaign within that box, as follows:

Once a campaign is started, you'll be able to see additional info such as its version and status:

If you no longer want the Python client (described below) to collect new samples for that particular Issue, you can click Deactivate Campaign.

The examples previously collected by a deactivated campaign will still be visible, and this campaign can easily be reactivated at any time by clicking Reactivate Campaign.

Collection Campaigns work as a point-in-time snapshot of all items inside an issue. The Python collection client will search for examples most similar to what was in that snapshot.

If you activate a Collection Campaign and subsequently modify the Issue it is based on (e.g. adding or removing elements), these changes won't be automatically picked up by the Python collection client.

However, a Commit New Campaign Version button will appear. To register any changes, you'll need to click this button:

2. Collect and Upload Your Data (Python Client)

In this section, we'll talk about the Setup and API calls necessary to start scanning your local data corpus, and uploading the examples that are most similar to the ones in your active Collection Campaign.

You'll be using Aquarium's Python client to write a script, similar to how you've used it in the past to upload your data.

Setup

If you've reached out to Aquarium and opted into the Collection Campaign preview, you should've received a direct sdist aquariumlearning python package. This can be installed via the following:

$ pip install aquariumlearning-0.0.19a1.tar.gz

Initialize Your Collection Client

First, you'll need to initialize a new collection client, much in the way you would initialize an aquariumlearning client when uploading data.

import aquariumlearning as al
al_client = al.CollectionClient() # Note: This is the new client to use
al_client.set_credentials(api_key="YOUR API KEY HERE")

Fetch the Latest Collection Campaign Info

One of the first commands to run is syncing state. The following command downloads information to the local client that represents all active Collection Campaigns.

al_client.sync_state()

NOTE: As the number of items in your Collection Campaigns increase, the amount of data downloaded also increases.

Please ensure that there is sufficient disk space to support your Collection Campaigns.

Preprocess Your Data Corpus

Now, you'll need to turn the data corpus you are scanning through into a construct the client can understand.

Luckily, the client already has a Labeled Frames data type to handle this. You can construct Labeled Frames much like you already do when you upload a dataset.

Unlike before, you'll add them directly to a list, rather than a Labeled Dataset:

corpus_of_data_frames = []
for item in my_corpus_of_data:
# Create a Frame
frame = al.LabeledFrame(frame_id=item.frame_id, date_captured=item.date_captured)
# Add relevant Metadata
frame.add_user_metadata("location", item.location)
frame.add_user_metadata("vehicle", item.vehicle)
# Add the actual image url
frame.add_image(
sensor_id=item.sensor_id, image_url=item.image_url, date_captured=item.date_captured
)
# Add relevant embeddings
frame.add_frame_embedding(embedding=item.frame_embedding)
# Add the frame to the list of frames
corpus_of_data_frames.append(frame)

Assign Similarity Scores to Each Data Corpus Frame

Now that you've transformed your data corpus into a list of Labeled Frames to scan, you'll call two simple API endpoints.

The first API call iterates through each frame in your list, and assigns a similarity score between this frame and each of the active Collection Campaigns. This call does not upload any data:

# Can be called any number of times
al_client.sample_probabilities(corpus_of_data_frames)

Filter and Upload Relevant Examples

The second API call will filter the frames based on an internally calibrated threshold. The frames that meet this threshold are the most similar examples, and will be uploaded back to Aquarium for analysis:

# Will upload all frames passing the threshold that had sample_probabilities
# called on it since the client was initialized
al_client.save_for_collection()

3. View your Collection Campaign (Web App)

Now you can view the collected samples in the web app!

To do so, simply navigate back to the Issue that contains the active Collection Campaign (or refresh the page if you already have it open). New data should've appeared, assuming that your data corpus had examples that passed the similarity threshold.

NOTE: Currently, we display the collected samples from all versions of a given Collection Campaign (not just the current one).

To export the collected frames, you can click the blue Download button, much like you already do when exporting Issue Elements from Aquarium today.

You can then send these to a labeling provider and use the results to retrain your model. This latest dataset iteration (and corresponding inference set) can be uploaded to Aquarium via the standard data ingestion flow, and you can continue repeating this process to improve your model performance!

Yay positive feedback loops