If you have a large corpus of new, unlabeled data, Aquarium's Collection Campaign feature helps you quickly collect the subset you actually want to use—without the need for someone to manually review the full corpus.
Based on a set of difficult edge cases identified in a pre-existing Issue, you can find more examples similar to these. You can then send these examples to a labeling provider, use them to retrain your model, and get the most model improvement for the least labeling cost!
This feature is currently in early preview (pre-alpha). Please reach out to us and let us know if you'd like to give it a try!
This guide runs through the complete flow of setting up a Collection Campaign and collecting new, unlabeled data similar to those you've previously identified in an Issue.
The flow will be something like this:
In order to successfully create a Collection Campaign, the following requirements must be met:
This feature will only work on Issues where all of the contained elements come from Datasets and Inference Sets uploaded on or after January 11th, 2021.
All elements within the Issue must be from the same Dataset or Inference Set.
NOTE: This means that an Issue can't have an element from a dataset and an element from the dataset's corresponding inference set. Those count as distinct sets.
Embeddings must be generated for the data corpus being searched through.
Whatever model you use to generate the new corpus's embeddings must be the same model you used to generate the Issue elements' embeddings.
The data corpus must have its data accessible to Aquarium (URLs, GCS paths, etc.), much like how it currently is for uploaded Datasets and Inference Sets.
In order to start a Collection Campaign, first navigate to the Issues tab in the web app.
After hard-refreshing (ctrl-shift-R), this should enable the Collection Campaign interface for your current browser session.
Now, navigate to an Issue that contains the sort of data that you want more examples of. (If you haven't done this yet, create such an Issue).
When you go to that Issue's page, you should see a Collection Campaign box in the right panel. Click the Start Campaign within that box, as follows:
Once a campaign is started, you'll be able to see additional info such as its version and status:
If you no longer want the Python client (described below) to collect new samples for that particular Issue, you can click Deactivate Campaign.
The examples previously collected by a deactivated campaign will still be visible, and this campaign can easily be reactivated at any time by clicking Reactivate Campaign.
In this section, we'll talk about the Setup and API calls necessary to start scanning your local data corpus, and uploading the examples that are most similar to the ones in your active Collection Campaign.
You'll be using Aquarium's Python client to write a script, similar to how you've used it in the past to upload your data.
If you've reached out to Aquarium and opted into the Collection Campaign preview, you should've received a direct sdist
aquariumlearning python package. This can be installed via the following:
$ pip install aquariumlearning-0.0.19a1.tar.gz
First, you'll need to initialize a new collection client, much in the way you would initialize an
aquariumlearning client when uploading data.
import aquariumlearning as alal_client = al.CollectionClient() # Note: This is the new client to useal_client.set_credentials(api_key="YOUR API KEY HERE")
One of the first commands to run is syncing state. The following command downloads information to the local client that represents all active Collection Campaigns.
Now, you'll need to turn the data corpus you are scanning through into a construct the client can understand.
Luckily, the client already has a Labeled Frames data type to handle this. You can construct Labeled Frames much like you already do when you upload a dataset.
Unlike before, you'll add them directly to a list, rather than a Labeled Dataset:
corpus_of_data_frames = for item in my_corpus_of_data:# Create a Frameframe = al.LabeledFrame(frame_id=item.frame_id, date_captured=item.date_captured)# Add relevant Metadataframe.add_user_metadata("location", item.location)frame.add_user_metadata("vehicle", item.vehicle)# Add the actual image urlframe.add_image(sensor_id=item.sensor_id, image_url=item.image_url, date_captured=item.date_captured)# Add relevant embeddingsframe.add_frame_embedding(embedding=item.frame_embedding)# Add the frame to the list of framescorpus_of_data_frames.append(frame)
Now that you've transformed your data corpus into a list of Labeled Frames to scan, you'll call two simple API endpoints.
The first API call iterates through each frame in your list, and assigns a similarity score between this frame and each of the active Collection Campaigns. This call does not upload any data:
# Can be called any number of timesal_client.sample_probabilities(corpus_of_data_frames)
Filter and Upload Relevant Examples
The second API call will filter the frames based on an internally calibrated threshold. The frames that meet this threshold are the most similar examples, and will be uploaded back to Aquarium for analysis:
# Will upload all frames passing the threshold that had sample_probabilities# called on it since the client was initializedal_client.save_for_collection()
Now you can view the collected samples in the web app!
To do so, simply navigate back to the Issue that contains the active Collection Campaign (or refresh the page if you already have it open). New data should've appeared, assuming that your data corpus had examples that passed the similarity threshold.
To export the collected frames, you can click the blue Download button, much like you already do when exporting Issue Elements from Aquarium today.
You can then send these to a labeling provider and use the results to retrain your model. This latest dataset iteration (and corresponding inference set) can be uploaded to Aquarium via the standard data ingestion flow, and you can continue repeating this process to improve your model performance!