Updating Datasets

How to update datasets after upload

Overview

This page will walk through best practices and show you how you can update your datasets and inference sets once they have been uploaded.

When updating a dataset or inference set, your classmap as originally defined in the project must be the same for the new/modified data

Fully Versioned w/ Edit History

As a dataset grows and changes, we maintain a versioned history of every previous version. This has many benefits, including:

  • Reproducible experiment results. If an experiment produced inferences that were evaluated against version X of the dataset, it can be evaluated and explored against that version, even if the dataset continues to be updated in the background.

  • Time-travel / rollbacks. Do you want to know what the dataset looked like before a major relabeling effort? Did that effort introduce problems that you want to undo? Load up a previous version at any time!

  • Edit histories / Audit logs. Each entry is versioned by its ID, so you can always look up the full history for a given image and see each modification made to its labels.

Versioning through Checkpoints

Aquarium released a feature named checkpoints that allows you to freeze the state of the dataset's frames and labels as of a point in time.

Use checkpoints to manage versions of your dataset over time and measure the impact of improving data quality or acquiring new data.

For more information regarding checkpoints, check out this page.

Streaming Inserts + Partial Success

Mutable datasets also allow you to upload data in a streaming format -- for example, one at a time as you receive images back from a labeling provider. If one batch of updates encounters an error, only those will fail, and the rest of the dataset will be processed and available for users.

Updating Frame Data

The following sections discuss how to:

Adding Frames to an Existing Dataset and Inference Set

You can walk through an almost identical process you used to do the initial upload but upload frames with a new frame_id and use create_or_update_dataset() or create_or_update_inferences() instead of create_dataset() or create_inferences().

The steps are the exact same as what you can find in the Uploading Data steps, the difference is that the frame_ids are new and unique when being added to a LabeledDataset or Inferences object.

When updating a dataset, make sure you're uploading data to existing project names, dataset names, and/or inference set names in the client API.

Example code (extremely similar for create_or_update_inferences()):

al_client.create_or_update_dataset(
    # project name of exisiting project with same classmap
    EXISTING_PROJECT_NAME, 
    # dataset name of existing dataset
    AL_DATASET, 
    dataset=dataset
)

Once the data has been uploaded you'll be able to interact with the dropdown in the top right corner of your screen to view your dataset versions based on the time it was uploaded.

Deleting/Removing Frames in a Dataset and Inference Set

To remove a frame from a dataset, you will use the delete_frame(frame_id) function in the Aquarium client.

Example usage:

import aquariumlearning as a

al_client = al.Client()
al_client.set_credentials(api_key='YOURKEY')

labeledDatset = al.LabeledDataset()

# list the frame ids to delete
frame_ids = ['']

# just like you would call add_frame to add to a labeled dataset
# we do the same thing with delete_frame to form an object we can 
# pass to the client so it knows what to delete
for id in frame_ids:
    labeledDatset.delete_frame(id)

# using create_or_update_dataset instead of just create_dataset
al_client.create_or_update_dataset(
    project_id='PROJECT_NAME',
    dataset_id='DATASET_NAME',
    dataset=labeledDatset
)

Once you run a script like the one above to delete frames, you'll see a message in your console similar to a normal data uploaded, and see an orange spinner next to your dataset name in the UI while the frames are being deleted.

When it comes to viewing different versions of your dataset, we can use the dropdown in the top right to also view changes after deletion. If you want to view your dataset prior to a deletion, select the appropriate date in the dropdown.

Updating Existing Frames In a Dataset and Inference Set

To update existing frame data in your dataset, specify the original frame_id when reuploading that frame, so that Aquarium can link it to the original. To update frame data we use the method update_dataset_frames(); this method is useful for bulk-updating frame metadata: sensor data, external metadata, etc. Any new information provided will be appended to the frame as a new version. The previous state of the frame will continue to be available as an old version.

Example usage:

import aquariumlearning as al

al_client = al.Client()
al_client.set_credentials(api_key='YOURAPIKEY')

# create list to hold frames to modify
# unlike other functions we dont add to LabeledDataset
# for updating frame level features we use a list
labeled_frame_list_to_modify = []

# create a labeled frame using an existing frame_id
# make sure you add the 'MODIFY' param
frame = al.LabeledFrame(frame_id='FRAME_ID', update_type='MODIFY')

# in this example we are adding a new metadata field
frame.add_user_metadata('test_metadata_field', "test_added_metadata_value")

# add frame object with new metadata field to the list
labeled_frame_list_to_modify.append(frame)

# call the client to push changes
al_client.update_dataset_frames(
    project_id='PROJECT_NAME',
    dataset_id='DATASET_NAME',
    update=labeled_frame_list_to_modify
)

Once you update the metadata, you can also use the dropdown in the top right corner to view your data before and after the update.

Updating Label Data

The following sections discuss how to:

Adding or Modifying Labels

When you want to add new labels to an existing frame or modify existing labels in an existing frame, the function to use is update_dataset_labels().

This function works with a class called UpdateGTLabelSet. This object is very similar to a frame object, and is used in cases of adding/modifying existing frames.

You would create an UpdateGTLabelSet object for each frame you are modifying. And then just like you initially called a function to add a bounding box to the initial frame, you'll do the same with the UpdateGTLabelSet object.

Add each UpdateGTLabelSet object to a list and you pass that list into the update_dataset_labels() function.

To demonstrate an example of how this will work, we will modify this frame and label.

In the code snippet, below, we are just showing an example of modifying the one label for the one frame pictured above, but you will likely be looping through some data to do your updates so the code may change:

import aquariumlearning as al

# configure Aquarium client
al_client = al.Client()
al_client.set_credentials(api_key='YOUR_API_KEY')

# specify project name and dataset name you will be working with
PROJECT_NAME = 'Rareplanes_Wingtype_Project'
DATASET_NAME = 'initial_train_labels'

# for the sake of example code
# this part will likely be looped but we have grabbed the frame id
# and label id pictured above
frame_id = '100_1040010039437200_tile_460'
label_id = '100_1040010039437200_tile_460_gt_0'

# define the list we will pass to update_dataset_labels
updateGTLabelSet_list = []

# defining an UpdateGTLabelSet for the frame that contains our label
update_GT_label_set = al.UpdateGTLabelSet(frame_id=frame_id)

# modifying an existing label, known through corresponding label id
# if label id doesnt exist in dataset, new label will be added
update_GT_label_set.add_2d_bbox(
    label_id=label_id,
    classification = CLASSIFICATION,
    top = NEW_TOP_VALUE,
    left = NEW_LEFT_VALUE,
    width = NEW_WIDTH_VALUE,
    height = NEW_HEIGHT_VALUE,
    user_attrs= DICT_OF_METADATS
)

# add our modified UpdateGTLabelSet object to the list
updateGTLabelSet_list.append(update_GT_label_set)

al_client.update_dataset_labels(PROJECT_NAME, DATASET_NAME, update_GT_label_set_list)

Once successfully run, you can see the newly modified label reflected in the UI!

Deleting Labels

Currently in Aquarium, to delete a label, you actually replace the complete existing frame with the correct set of labels minus the labels you'd like to delete/remove. The steps to delete labels will look almost identical to the initial label upload process. The steps are:

  1. Create a new LabeledDataset object

  2. For each frame that has a label you would like to delete, create a LabeledFrame object making sure to set the update_type parameter to ADD

  3. Add all the correct/desired labels minus the ones you wish to delete to the LabeledFrame object

  4. Add LabeledFrame object to the LabeledDataset using add_frame()

  5. Finally, use create_or_update_dataset() passing in your project name, dataset name, and the LabeledDataset

You only need to created LabeledFrames for the frames that have labels that need updating. You don't have to complete this process for every frame in your dataset!

By created the LabeledFrame object using ADD, this allows us to rewrite the labels associated with the frame. You'll be able to view this change in the history of the frame to view old and new labels.

Example below shows the original frame version with four total labels (green boxes):

This image shows the after where we have removed all but one label:

Example code block below:

import aquariumlearning as al

al_client = al.Client()
al_client.set_credentials(api_key='YOUR_API_KEY')

PROJECT_NAME = 'PROJECT_NAME'
DATASET_NAME = 'DATASET_NAME'

# in example images would be 51_104001003D4C9C00_tile_264
FRAME_ID = 'FRAME_ID'

# create new dataset
labeled_dataset = al.LabeledDataset()

# create new labeled frame object, remember the udpate_type param
new_labeled_frame = al.LabeledFrame(frame_id = FRAME_ID, update_type='ADD')

# add the appropriate labels
# you can use same label ids or new label ids
new_labeled_frame.add_label_2d_bbox(
    label_id='LABEL_ID',
    classification = 'straight',
    top = TOP_VAL,
    left = LEFT_VAL, 
    width = WIDTH_VAL,
    height = HEIGHT_VAL
)

# add your image to the frame
new_labeled_frame.add_image(image_url='ADD_IMAGE_SOURCE_URL')

# add your frame to dataset
labeled_dataset.add_frame(new_labeled_frame)

# upload newly created dataset using create_or_update_dataset
al_client.create_or_update_dataset(PROJECT_NAME, DATASET_NAME, dataset=labeled_dataset)

Notes & Limitations

There are a few things to be aware of when using mutable datasets:

Some Dataset Attributes Are Still Immutable

This change allows all elements of a dataset (frames, metadata values, labels, bounding box geometry, etc.) to be added / updated / deleted, but they must still be compatible with the dataset as a whole.

Most notably, the following dataset attributes must remain consistent over time:

  • Set of known valid label classes

  • User provided metadata field schemas

  • Embedding source (i.e.., embeddings are expected to be compatible between all frames in the dataset)

We plan to support changes to all of these in the future. Please let us know if any of them are particularly valuable for you.

Inference Sets Are Pinned to a Specific Dataset Version

When an inference set is uploaded, it will be pinned to a specific version of the labeled dataset, which will default to the most up-to-date version at the time of submission.

Updates to the inference set itself will show up in the UI, but updates to the base dataset (ground truth) won't be incorporated.

Metrics shown will be computed against those pinned dataset labels, and any visualizations of the ground truth will be from that specific version.

Segment Elements are Versioned

Elements (frames, labels, inferences) added to segments correspond to the specific version of the element when it was added to the segment. They do not automatically update to the latest version of the element within the dataset. This is intentional, but has tradeoffs:

  • At any time, you can review elements in segments as they were when they were created. This makes it easy to reproduce issues in datasets across members of the team, even as the labels or associated frames change.

  • Because you're viewing the version of the dataset as it was when it was added to the segment, reviewing a segment after a data quality issue has been corrected may show that the issue is still present, despite the current version of the dataset being correct. Use the segment state tracking features in Aquarium (and archive old segments that have been resolved) to mitigate any potential confusion.

For example, you might create a segment with example labels of "bounding box too loose." If you go re-label those boxes and update them in the dataset, the segment will still contain the original (poorly drawn) labels, with an icon indicating that it belongs to an older version of the dataset.

It will be available for viewing, but some features (like within-dataset similarity search) may be disabled for out-of-date elements.

Monitoring Upload Status

Similar to batch uploads, you'll be able to view the status of your streaming uploads in the web app.

If you go to the Project Details page, you'll see a Streaming Uploads tab (previous batch uploads under your project will still be visible under Uploads):

Each upload ID corresponds to a subset of your dataset/inference set (with the associated frame count + label count).

To view more details on which specific frames/labels are present in a given upload, you can click on the Status (e.g. DONE). A pop-up will appear with the following info:

In the case of a failed upload, you can debug via the Errors section (which exposes frame-specific debug logs), and download this info to determine which frames/crops may need to be re-uploaded.

If you are running into an error and the error logs are not sufficient to understand how to fix the issue, please reach out to the Aquarium team and we can help resolve your problem.

Migrating Projects with Immutable Datasets to Mutable

This section only applies if you ever created a Dataset in BATCH mode. Any update after June 2022 is uploaded in STREAMING mode by default and you will not need this section.

Any scripts that make calls to client.create_dataset or client.create_or_update_dataset no longer need to pass the pipeline_mode argument in order to use streaming mode, as "STREAMING" is now the default argument. However, if you would prefer to continue using batch mode for your uploads, you will now have to specify pipeline_mode="BATCH".

You may have existing immutable datasets that were uploaded via batch mode, and want to convert them to mutable datasets.

If you go to the "Datasets" tab of your "Project Details" page, each of the listed legacy datasets should now have a new teal "Clone as New Mutable Dataset" button:

When you click this button, the cloning will begin:

After a minute or so, if you refresh the page, the new dataset will appear with the prefix "MUTABLE_". The old dataset will also have a tooltip that points to the new dataset:

Depending on the size of your original dataset, it may take some more time for this new mutable dataset to be fully processed and become viewable in "Explore" view.

NOTE: A dataset's corresponding inference sets will not be automatically cloned for now, but can be uploaded to the mutable dataset using the Aquarium client. Please contact us if you have questions about migrating inference sets.

Last updated