Collection Campaigns

Submit the most relevant data for labeling

If you have a large corpus of new, unlabeled data, Aquarium's Collection Campaign feature helps you quickly collect the subset you actually want to useβ€”without the need for someone to manually review the full corpus.

Based on a set of difficult edge cases identified in a pre-existing Issue, you can find more examples similar to these. You can then send these examples to a labeling provider, use them to retrain your model, and get the most model improvement for the least labeling cost!

User Guide

This guide runs through the complete flow of setting up a Collection Campaign and collecting new, unlabeled data similar to those you've previously identified in an Issue.

The flow will be something like this:

Collection Campaign Flow

Requirements

In order to successfully create a Collection Campaign, the following requirements must be met:

  • This feature will only work on Issues where all of the contained elements come from Datasets and Inference Sets uploaded on or after January 11th, 2021.

  • All elements within the Issue must be from the same Dataset or Inference Set.

    • NOTE: This means that an Issue can't have an element from a dataset and an element from the dataset's corresponding inference set. Those count as distinct sets.

  • ​Embeddings must be generated for the data corpus being searched through.

    • Whatever model you use to generate the new corpus's embeddings must be the same model you used to generate the Issue elements' embeddings.

  • The data corpus must have its data accessible to Aquarium (URLs, GCS paths, etc.), much like how it currently is for uploaded Datasets and Inference Sets.

1. Start a Collection Campaign (Web App)

In order to start a Collection Campaign, first navigate to the Issues tab in the web app.

Navigate to an Issue that contains the sort of data that you want more examples of. (If you haven't done this yet, create such an Issue.)

For collection campaigns, a well-curated set of issue elements will help you achieve better sampling results.

Ordinarily, this can be a lengthy and tedious manual process, but you can use the process described in Finding Similar Elements Within a Dataset to speed that up.

When you go to that Issue's page, you should see a Collection Campaign box in the right panel. Click the Start Campaign within that box, as follows:

Once a collection campaign is created for an issue, you should see a new Collection Samples tab pop up. There won't be any samples displayed, because the Python collection client hasn't run yet.

In the sidebar, you'll also be able to see additional info such as its version and status:

Deactivating and Reactivating Campaigns

If you no longer want the Python client (described below) to collect new samples for that particular Issue, you can click Deactivate Campaign.

The examples previously collected by a deactivated campaign will still be visible, and this campaign can easily be reactivated at any time by clicking Reactivate Campaign.

Set a Sampling Threshold (Optional)

The sampling threshold (default 0.5) allows you to control how "strict" you want to be for a given campaign. During sampling, a similarity score is calculated for each unlabeled dataframe, which determines whether it qualifies for upload. A lower threshold will result in more samples, but also more false positives. You can tune it accordingly to your labeling needs.

Collection Campaigns work as a point-in-time snapshot of all items inside an issue. The Python collection client will search for examples most similar to what was in that snapshot.

If you activate a Collection Campaign and subsequently modify the Issue it is based on (e.g. adding or removing elements), these changes won't be automatically picked up by the Python collection client.

However, a Commit New Campaign Version button will appear. To register any changes, you'll need to click this button:

​

2. Collect and Upload Your Data (Python Client)

In this section, we'll talk about the Setup and API calls necessary to start scanning your local data corpus, and uploading the examples that are most similar to the ones in your active Collection Campaign.

You'll be using Aquarium's Python client to write a script, similar to how you've used it in the past to upload your data.

Initialize Your Collection Client

First, you'll need to initialize a new collection client, much in the way you would initialize an aquariumlearning client when uploading data.

import aquariumlearning as al
al_client = al.CollectionClient() # Note: This is the new client to use
al_client.set_credentials(api_key="YOUR API KEY HERE")

Fetch the Latest Collection Campaign Info

One of the first commands to run is syncing state. The following command downloads information to the local client that represents all active Collection Campaigns.

al_client.sync_state()

NOTE: As the number of items in your Collection Campaigns increase, the amount of data downloaded also increases.

Please ensure that there is sufficient disk space to support your Collection Campaigns.

Optionally, you can constrain sampling to collection campaigns to a specific list of projects:

project_names = ["some_project_1", "some_project_2"]
al_client.sync_state(target_project_names=project_names)

Alternately, if you want to specify individual issues, you can do so as follows:

issue_uuids = ["cf8c92f5-e720-47fd-bf8e-ed5b07d47372", "5e8cb31c-9b3e-4a97-89b8-3428543a9778"]
al_client.sync_state(target_issue_uuids=issue_uuids)

Preprocess Your Data Corpus

Now, you'll need to turn the data corpus you are scanning through into a construct that the client can understand.

Luckily, the client already has a Labeled Frames data type to handle this. You can construct Labeled Frames much like you already do when you upload a dataset.

Unlike before, you'll add them directly to a list, rather than a Labeled Dataset:

corpus_of_data_frames = []
for item in my_corpus_of_data:
# Create a Frame
frame = al.LabeledFrame(frame_id=item.frame_id, date_captured=item.date_captured)
# Add relevant Metadata
frame.add_user_metadata("location", item.location)
frame.add_user_metadata("vehicle", item.vehicle)
# Add the actual image url
frame.add_image(
sensor_id=item.sensor_id, image_url=item.image_url, date_captured=item.date_captured
)
# Add relevant embeddings
frame.add_frame_embedding(embedding=item.frame_embedding)
# Add the frame to the list of frames
corpus_of_data_frames.append(frame)

Labels (optional)

Since all the frames in your corpus are labeled frames, labels can also be added to each one if needed. Adding labels to each frame must use the same task type as the dataset you are running the collection campaign on (e.g. use add_label_2d_classification if your dataset is a 2D classification task, etc.). Added frame labels will be visible in the collection campaign results.

NOTE: It is not currently supported to add confidence to your labels in collection campaign results.

Here are some common label types, their expected formats, and how to work with them in Aquarium:

Classification
2D Bounding Box
3D Cuboid
2D Semseg
2D Polygon Lists
Classification

​

# Standard 2D case
frame.add_label_2d_classification(
# The sensor id of the image this label corresponds to
sensor_id='some_camera',
# A unique id across all other labels in this dataset
label_id='unique_id_for_this_label',
classification='dog'
)
​
# 3D classification
frame.add_label_3d_classification(
# A unique id across all other labels in this dataset
label_id='unique_id_for_this_label',
classification='dog',
# Optional, defaults to implicit WORLD coordinate frame
coord_frame_id='robot_ego_frame',
)

​

2D Bounding Box

​

frame.add_label_2d_bbox(
# The sensor id of the image this label corresponds to
sensor_id='some_camera',
# A unique id across all other labels in this dataset
label_id='unique_id_for_this_label',
classification='dog',
# Coordinates are in absolute pixel space
top=200,
left=300,
width=250,
height=150
)

​

3D Cuboid

Aquarium supports 3D cuboid labels, with 6-DOF position and orientation.

frame.add_label_3d_cuboid(
label_id="unique_id_for_this_label",
classification="car",
# XYZ dimensions of this cuboid
dimensions=[1.0, 0.5, 0.5],
# XYZ position of the center of this object
position=[2.0, 2.0, 1.0],
# An XYZW ordered object rotation quaternion
rotation=[0.0, 0.0, 0.0, 1.0],
# Optional: If your cuboid is relative to a specific
# coordinate frame, you can reference it by name here.
coord_frame_id="robot_ego_frame"
)

​

2D Semseg

2D Semantic Segmentation labels are represented by an image mask, where each pixel is assigned an integer value in the range of [0,255]. For efficient representation across both servers and browsers, Aquarium expects label masks to be encoded as grey-scale PNGs of the same dimension as the underlying image.

If you have your label masks in the form of a numpy ndarray, we recommend using the pillow python library to convert it into a PNG:

! pip3 install pillow
​
from PIL import Image
...
​
# 2D array, where each value is [0,255] corresponding to a class_id
# in the project's label_class_map.
int_arr = your_2d_ndarray.astype('uint8')
​
Image.fromarray(int_arr).save(f"{imagename}.png")

Because this will be loaded dynamically by the web-app for visualization, this image mask will need to be hosted somewhere. To upload it as an asset to Aquarium, you can use the following utility:

mask_url = al_client.upload_asset_from_filepath(project_id, dataset_id, filepath)

This utility hosts and stores a copy of the label mask (not the underlying RGB image) with Aquarium. If you would like your label masks to remain outside of Aquarium, chat with us and we'll help figure out a good setup.

Now, we add the label to the frame like any other label type:

frame.add_label_2d_semseg(
# The sensor id of the image this label corresponds to
sensor_id='some_camera',
# A unique id across all other labels in this dataset
label_id='unique_id_for_this_label',
# Expected to be a PNG, with values in [0,255] that correspond
# to the class_id of classes in the label_class_map
mask_url='url_to_greyscale_png'
)

​

2D Polygon Lists

Aquarium represents instance segmentation labels as 2D Polygon Lists. Each label is represented by one or more polygons, which do not need to be connected.

frame.add_label_2d_polygon_list(
# The sensor id of the image this label corresponds to
sensor_id='some_camera',
# A unique id across all other labels in this dataset
label_id='unique_id_for_this_label',
classification='dog',
# All coordinates are in absolute pixel space
#
# These are polygon vertices, not a line string. This means
# that no vertices are duplicated in the lists.
polygons=[
{'vertices': [(x1, y1), (x2, y2), ...]},
{'vertices': [(x1, y1), (x2, y2), ...]}
],
# Optional: indicate the center position of the object
center: [center_x, center_y]
)

​

Assign Similarity Scores to Each Data Corpus Frame

Now that you've transformed your data corpus into a list of Labeled Frames to scan, you'll call two simple API endpoints.

The first API call iterates through each frame in your list, and assigns a similarity score between this frame and each of the active Collection Campaigns. This call does not upload any data:

# Can be called any number of times
al_client.sample_probabilities(corpus_of_data_frames)

Filter and Upload Relevant Examples

The second API call will filter the frames based on an internally calibrated threshold. This threshold is determined as follows:

  • If the override_sampling_threshold parameter is specified in thesave_for_collection call, this threshold is used for all of the collection campaigns from the earlier sync_state call

  • Otherwise, if a campaign's sampling threshold was specifically configured in the web app, this is the threshold used for that campaign.

  • If no override or campaign-specific threshold was set, a default of 0.5 is used.

The frames that meet this threshold are the most similar examples, and will be uploaded back to Aquarium for analysis:

# Will upload all frames passing the threshold that had
# sample_probabilities called on it since the client was
# initialized
al_client.save_for_collection()
​
# Alternately, you can specify an override threshold.
al_client.save_for_collection(override_sampling_threshold=0.7)

3. View your Collection Campaign (Web App)

Now you can view the collected samples in the web app!

To do so, simply navigate back to the Issue that contains the active Collection Campaign (or refresh the page if you already have it open). New data should've appeared, assuming that your data corpus had examples that passed the similarity threshold.

You can sort the samples according to similarity score or campaign version.

Understanding Why Samples were Selected

If you are using the most recent version of the client, you can now view the cluster of issue elements that a sample was closest to (which may help build intuition on why a sample was selected).

Simply click on the question mark displayed next to a particular sample's campaign version info:

Viewing Collection Rate

Note: Collection rate is not displayed for older collection campaigns---some of the info required to calculate this was not recorded at the time.

In the sidebar, you can see the collection rate of your campaign:

This reports the number of samples uploaded, out of the number of dataframes actually processed by the Python collection client.

Note: Although uploaded samples are deduped bytask_id , the collection client does not dedupe when tracking the number of frames that have been looked at.

Consequently, if you run the collection client over the same (or overlapping) set of unlabeled dataframes, your reported collection rate will be lower than it actually is.

Discarding Bad Samples

To remove samples that don't match what you are looking for, you can select and discard them:

Exporting Samples for Labeling

To export the collected frames, you can click the blue Download button, much like you already do when exporting Issue Elements from Aquarium today.

You can then send these to a labeling provider and use the results to retrain your model. This latest dataset iteration (and corresponding inference set) can be uploaded to Aquarium via the standard data ingestion flow, and you can continue repeating this process to improve your model performance!

Yay positive feedback loops