Mutable Datasets (Beta)

Previously, all datasets in Aquarium were immutable -- any change or addition to a dataset would require re-uploading it under a new name. We're now starting to roll out preview access to our new Mutable Datasets.

We've provided a button in the app to make this switchover easier! By clicking this button, you can clone an existing dataset as a new mutable dataset. See below for more details.

Key Features

Fully Versioned w/ Edit History

As a dataset grows and changes, we maintain a versioned history of every previous version. This has many benefits, including:

  • Reproducible experiment results. If an experiment produced inferences that were evaluated against version X of the dataset, it can be evaluated and explored against that version, even if the dataset continues to be updated in the background.

  • Time-travel / rollbacks. Do you want to know what the dataset looked like before a major relabeling effort? Did that effort introduce problems that you want to undo? Load up a previous version at any time!

  • Edit histories / Audit logs. Each entry is versioned by its ID, so you can always look up the full history for a given image and see each modification made to its labels.

Streaming Inserts + Partial Success

Traditional Aquarium datasets required batching together a full dataset into a single operation, which would either fully succeed or fully fail after analyzing all entries.

Mutable datasets also allow you to upload data in a streaming format -- for example, one at a time as you receive images back from a labeling provider. If one batch of updates encounters an error, only those will fail, and the rest of the dataset will be processed and available for users.

Usage Guide

Mutable Datasets is a beta feature that is off by default. To opt in, please reach out to the Aquarium team and we'll enable it for your organization.

Python Client Upload Process

The upload process for mutable datasets is very similar to the non-mutable process, with some minor changes.

Note that legacy, non-mutable datasets/inference sets will not support updates. Mutability is only supported for newly created datasets/inference sets that meet the following conditions:

  1. The dataset (or both inference set and its base dataset) must be created with pipeline_mode="STREAMING" flag set.

  2. All subsequent updates to the dataset/inference set must also have pipeline_mode="STREAMING" flag set.

(This is demonstrated in the subsequent sample code).

(1) Creating a Streaming Dataset

Just as before, you'll need to build a LabeledDataset and add the frames that you want to be in your dataset. This is the same as before, and looks something like the following:

with open('./labels.json') as f:
label_entries = json.load(f)
dataset = al.LabeledDataset(pipeline_mode="STREAMING")
for entry in label_entries:
# Create a frame object, using the filename as an id
frame_id = entry['file_name'].split('.jpg')[0]
frame = al.LabeledFrame(frame_id=frame_id)
# Add arbitrary metadata, such as the train vs test split
frame.add_user_metadata('split_name', entry['split_name'])
# You can also add arbitrary metadata as a list, such as a tag list
frame.add_user_metadata_list('data_tags', ['pets', entry['species_name'])
# Add an image to the frame
image_url = "https://storage.googleapis.com/aquarium-public/quickstart/pets/imgs/" + entry['file_name']
frame.add_image(sensor_id='cam', image_url=image_url)
# Add the ground truth classification label to the frame
label_id = frame_id + '_gt'
frame.add_label_2d_classification(
sensor_id='cam',
label_id=label_id,
classification=entry['class_name']
)
# Add the frame to the dataset collection
dataset.add_frame(frame)

Once you have built yourLabeledDataset, you can upload it using the new method create_or_update_dataset(...) This same function can be used for either (1) creating a new dataset or (2) updating an existing dataset. This is where you will need to specify pipeline_mode="STREAMING".

(NOTE: The legacy function create_dataset(...) will still work. It just points to the more accurately named create_or_update_dataset(...)).

AL_DATASET = 'fullset_0'
al_client.create_or_update_dataset(
AL_PROJECT,
AL_DATASET,
dataset=dataset,
# Poll for completion of the processing job
wait_until_finish=True,
# Preview the first frame before submission to catch mistakes
preview_first_frame=True,
# The pipeline_mode flag must be set to use mutable datasets
pipeline_mode="STREAMING",
)

(2) Updating a Streaming Dataset

Updating an existing dataset uses the exact same calls as above:

  • Add new frames: Instantiate a LabeledDataset with new frames, and invoke the create_or_update_dataset(...)call with the name of the existing dataset_name(here it's AL_DATASET) and thepipeline_mode="STREAMING" flag.

  • Update existing frames: Instantiate a LabeledDatasetwith the updated frames. Frames are keyed onframe_id. This means that if a new frame has the same frame_id as an already existing frame, it will show up as the most recent version (as if it were overwriting it---though of course we still keep track of all prior versions!).

Note: You can add and update frames in a singlecreate_or_update_dataset(...)call. This will automatically happen if the LabeledDatasetyou pass in has frames with both new and existing frame_ids.

(3) Creating a Streaming Inference Set

As mentioned before, creating a mutable inference set is very similar to before.

  1. Both the inference set and its base dataset must be created with pipeline_mode="STREAMING" flag set.

  2. All subsequent updates to the inference set must also have pipeline_mode="STREAMING" flag set.

  3. Inferences for a given frame can't be uploaded until its equivalent in the base dataset has been processed. (In other words, a frame_idreferenced by the inference set must already exist in the base dataset).

To avoid uploading inference frames before they exist on the base dataset (issue (3) described above), you can use one of two methods:

  1. In your Python upload script, upload the dataset frames with the flag wait_until_finish=True . Add a subsequent call to upload inferences (which should then only run when the dataset is finished).

  2. Wait until you see your dataset elements marked as DONE in the web UI's project info uploads tab and then upload your inferences.

Creating a new mutable inference set is very similar as well. You will need to build an Inferences and add the frames that you want to be in your inferences.

with open('./inferences.json') as f:
inference_entries = json.load(f)
inferences = al.Inferences()
for entry in inference_entries:
# Create a frame object, using the same id
frame_id = entry['frame_id']
inf_frame = al.InferencesFrame(frame_id=frame_id)
# Add the inferred classification label to the frame
inf_label_id = frame_id + '_inf'
inf_frame.add_inference_2d_classification(
sensor_id='cam',
label_id=inf_label_id,
classification=entry['class_name'],
confidence=entry['confidence']
)
# Add the frame to the inferences collection
inferences.add_frame(inf_frame)

We can then submit it in much the same way.

al_client.create_or_update_inferences(
AL_PROJECT,
AL_DATASET,
inferences=inferences,
inferences_id='demo_model_1',
wait_until_finish=True,
# The pipeline_mode flag must be set to use mutable inferences
pipeline_mode="STREAMING",
)

(4) Updating a Streaming Inference Set

Updating an existing inference set uses the exact same calls as above:

  • Add new frames/labels: Instantiate Inferences with new frames, and invoke the create_or_update_inferences(...)call with the existing inferences_idand thepipeline_mode="STREAMING" flag.

  • Update existing frames: Instantiate Inferenceswith the updated frames. Frames are keyed onframe_id, and labels on a combination of frame_idand label_id. This means that if a new frame has the same frame_id as an already existing frame (or similarly a new label for an existing frame_id

    / label_idcombo), it will show up as the most recent version. Again, as before, make sure to invoke the create_or_update_inferences(...)call with the existing inferences_idand thepipeline_mode="STREAMING" flag.

Now you have successfully created and updated mutable datasets and inferences!

If you have any questions or concerns, please reach out to the Aquarium team and we can help you out.

Viewing Uploads in the App

Similar to batch uploads, you'll be able to view the status of your streaming uploads in the web app.

If you go to the Project Details page, you'll see a new Streaming Uploads tab (previous batch uploads under your project will still be visible under Uploads):

Each upload ID corresponds to a subset of your dataset/inference set (with the associated frame count + label count).

To view more details on which specific frames/labels are present in a given upload, you can click on the Status (e.g. DONE). A popup will appear with the following info:

In the case of a failed upload, you can debug via the Errors section (which exposes frame-specific debug logs), and download this info to determine which frames/crops may need to be re-uploaded.

If you are running into an error and the error logs are not sufficient to understand how to fix the issue, please reach out to the Aquarium team and we can help resolve your problem.

New UI Features

For now the rest of the Web App UI should be unchanged and new/updated dataset/inference elements will be explorable when the upload enters the DONE state.

Clone Existing Legacy Datasets

If your org is opted into Mutable Datasets, you should be able to easily clone your legacy datasets as new mutable datasets, via the web app.

If you go to the "Datasets" tab of your "Project Details" page, each of the listed legacy datasets should now have a new teal "Clone as New Mutable Dataset" button:

When you click this button, the cloning will begin:

After a minute or so, if you refresh the page, the new dataset will appear with the prefix "MUTABLE_". The old dataset will also have a tooltip that points to the new dataset:

Depending on the size of your original dataset, it may take some more time for this new mutable dataset to actually be viewable in "Explore" view.

NOTE: A dataset's corresponding inference sets will not be cloned for now. Please reach out if that is a blocking issue.

Beta Notes & Limitations

Some Dataset Attributes are Still Immutable

This change allows all elements of a dataset (frames, metadata values, labels, bounding box geometry, etc.) to be added / updated / deleted, but they must still be compatible with the dataset as a whole.

Most notably, the following dataset attributes must remain consistent over time:

  • Set of known valid label classes

  • User provided metadata field schemas

  • Embedding source (i.e.., embeddings are expected to be compatible between all frames in the dataset)

We plan to support changes to all of these in the future. Please let us know if any of them are particularly valuable for you.

Inference Sets are Pinned to a Specific Dataset Version

When an inference set is uploaded, it will be pinned to a specific version of the labeled dataset, which will default to the most up-to-date version at the time of submission.

Updates to the inference set itself will show up in the UI, but updates to the base dataset (ground truth) won't be incorporated.

Metrics shown will be computed against those pinned dataset labels, and any visualizations of the ground truth will be from that specific version.

Issues can Contain Out-of-Date Elements

When you add something to an issue, you're adding a specific version of it. If that element then gets updated in the dataset, the issue will continue referencing that older version of it.

For example, you might create an issue with example labels of "bounding box too loose." If you go re-label those boxes and update them in the dataset, the issue will still contain the original (poorly drawn) labels, with an icon indicating that it belongs to an older version of the dataset.

Warning icon indicating that an issue element is out-of-date.

It will be available for viewing, but some features (like within-dataset similarity search) may be disabled for out-of-date elements.