Data Curation Workflow (Beta)

We've spent the last few months working on features to allow an Aquarium user to go through all of the steps in a data curation workflow - finding mistakes, searching through unlabeled datasets to find the best data to label, dispatching data to labeling, and updating the dataset with new labels - without needing to write any code.

The following video highlights this end to end workflow:

Usage Guide

The Data Curation Workflow is a beta feature that is off by default. To opt in, please reach out to the Aquarium team and we'll enable it for your organization.

Currently Aquarium only supports this workflow:

  • Labelbox Integrations. We will be adding support for Scale and in-house labeling integrations. If you have another particular labeling configuration that you'd like support for, please reach out to our team to discuss.

  • Classification and 2D Bounding Box labeling tasks.

  • Issues sourced from a single dataset or inference set.

(1) Upload a Mutable Dataset

First upload a mutable dataset if you do not already have one (see docs).

You can tell that you have a mutable dataset available if you have successful "Streaming Uploads" listed on your Project Details page.

(2) Add a Labeling Integration

If you don't already have labeling integrations set up for your org (you won't have any if you're trying this feature for the first time!), you will need to add a labeling integration by going to your Organization Settings:

You should see the option to add a new labeling integration:

Once clicked, you should see the following modal:

You will need to enter your provider-specific API key. In the case of Labelbox, you should be able to find or generate your API keys here: https://app.labelbox.com/account/api-keys.

We'll check if the integration key seems valid, and will display an error otherwise.

(3) Create a "Rare Scenario" Issue

The data curation workflow also incorporates a new Create Issue UX. To create an issue that supports collections and labeling, you will need to use the + Create Issue button rather than the + Add to Issue button:

You will then see the following modal:

To be able to use the collection and labeling flows, you will need to select "Rare Scenario". If you simply want functionality similar to "legacy" issues, choose the "Generic" option.

You can then select your desired Labelbox project and dataset as follows:

Note that we do some validation to ensure that:

  • Your Aquarium project's primary_task is either Object Detection (the default) or Classification.

  • The Labelbox ontology schema matches your project type (e.g. if your Aquarium project has primary_task CLASSIFICATION , your corresponding Labelbox project needs to support a classification labeling flow).

  • Your Labelbox labels have the same names as the classmap for your Aquarium project.

Your Issue Detail View will also look slightly different:

NOTE: To add to an existing issue, you will follow the same flow as before (via the + Add to Issue button).

(4) Grow Your Issue (Find Similar Dataset Elements)

Sometimes you may only have a few examples of your rare scenario, and you may want to find more to provide a good "seed" for collection campaigns.

You can use Find Similar Dataset Elements to find and add more elements from your existing dataset:

(5) Run a Collection (Python Client or Unlabeled Indexed)

We will have opted your team into the Unlabeled Indexed Collections feature (uploading unlabeled datasets + running collections via the UI) as a result of including you in the Data Curation workflow. If you prefer to run your collections via the Python client, let us know and we can adjust your org settings accordingly.

To run an Unlabeled Indexed Collection via the UI, you will first need to upload an unlabeled indexed dataset. (See docs)

If you have an object detection task, we recommend that you upload unlabeled frames with bounding boxes that correspond to your model's proposals.

You can see your unlabeled datasets in your Project Details page:

Then, in the Collected tab of your Issue Detail View, you can (1) from the dropdown, select an unlabeled dataset to search through and (2) select export your desired results to labeling:

(6) Export to Labeling and Review Results

Once you've exported new collection frames to Labelbox, the frames that are pending labeling will be visible in the Labeling tab:

Aquarium monitors the status of these frames and will update them once a labeler has completed.

NOTE: The web app does not do live updates----you can click the refresh icon in the upper right corner to get status updates.

(7) Add New Frames to Dataset

You can view completed labeled frames in the Done tab. Select the frames you want added back into your original dataset, and click the Add All Frames to Dataset button.

After a refresh, you can see a "loading icon" in the lower right corner of the frame, indicating that it is being processed:

Frames that have been successfully added will be marked with a check icon: