Previously, all datasets in Aquarium were immutable -- any change or addition to a dataset would require re-uploading it under a new name. We're now starting to roll out preview access to our new Mutable Datasets.
As a dataset grows and changes, we maintain a versioned history of every previous version. This has many benefits, including:
Reproducible experiment results. If an experiment produced inferences that were evaluated against version X of the dataset, it can be evaluated and explored against that version, even if the dataset continues to be updated in the background.
Time-travel / rollbacks. Do you want to know what the dataset looked like before a major relabeling effort? Did that effort introduce problems that you want to undo? Load up a previous version at any time!
Edit histories / Audit logs. Each entry is versioned by its ID, so you can always look up the full history for a given image and see each modification made to its labels.
Traditional Aquarium datasets required batching together a full dataset into a single operation, which would either fully succeed or fully fail after analyzing all entries.
Mutable datasets also allow you to upload data in a streaming format -- for example, one at a time as you receive images back from a labeling provider. If one batch of updates encounters an error, only those will fail, and the rest of the dataset will be processed and available for users.
The upload process for mutable datasets is very similar to the non-mutable process, with some minor changes.
Just as before, you'll need to build a
LabeledDataset and add the frames that you want to be in your dataset. This is the same as before, and looks something like the following:
with open('./labels.json') as f:label_entries = json.load(f)dataset = al.LabeledDataset(pipeline_mode="STREAMING")for entry in label_entries:# Create a frame object, using the filename as an idframe_id = entry['file_name'].split('.jpg')frame = al.LabeledFrame(frame_id=frame_id)# Add arbitrary metadata, such as the train vs test splitframe.add_user_metadata('split_name', entry['split_name'])# You can also add arbitrary metadata as a list, such as a tag listframe.add_user_metadata_list('data_tags', ['pets', entry['species_name'])# Add an image to the frameimage_url = "https://storage.googleapis.com/aquarium-public/quickstart/pets/imgs/" + entry['file_name']frame.add_image(sensor_id='cam', image_url=image_url)# Add the ground truth classification label to the framelabel_id = frame_id + '_gt'frame.add_label_2d_classification(sensor_id='cam',label_id=label_id,classification=entry['class_name'])# Add the frame to the dataset collectiondataset.add_frame(frame)
Once you have built your
LabeledDataset, you can upload it using the new method
create_or_update_dataset(...) This same function can be used for either (1) creating a new dataset or (2) updating an existing dataset. This is where you will need to specify
(NOTE: The legacy function
create_dataset(...) will still work. It just points to the more accurately named
AL_DATASET = 'fullset_0'al_client.create_or_update_dataset(AL_PROJECT,AL_DATASET,dataset=dataset,# Poll for completion of the processing jobwait_until_finish=True,# Preview the first frame before submission to catch mistakespreview_first_frame=True,# The pipeline_mode flag must be set to use mutable datasetspipeline_mode="STREAMING",)
Updating an existing dataset uses the exact same calls as above:
Add new frames: Instantiate a
LabeledDataset with new frames, and invoke the
create_or_update_dataset(...)call with the name of the existing
AL_DATASET) and the
Update existing frames: Instantiate a
LabeledDatasetwith the updated frames. Frames are keyed on
frame_id. This means that if a new frame has the same
frame_id as an already existing frame, it will show up as the most recent version (as if it were overwriting it---though of course we still keep track of all prior versions!).
As mentioned before, creating a mutable inference set is very similar to before.
To avoid uploading inference frames before they exist on the base dataset (issue (3) described above), you can use one of two methods:
In your Python upload script, upload the dataset frames with the flag
wait_until_finish=True . Add a subsequent call to upload inferences (which should then only run when the dataset is finished).
Wait until you see your dataset elements marked as
DONE in the web UI's project info uploads tab and then upload your inferences.
Creating a new mutable inference set is very similar as well. You will need to build an
Inferences and add the frames that you want to be in your inferences.
with open('./inferences.json') as f:inference_entries = json.load(f)inferences = al.Inferences()for entry in inference_entries:# Create a frame object, using the same idframe_id = entry['frame_id']inf_frame = al.InferencesFrame(frame_id=frame_id)# Add the inferred classification label to the frameinf_label_id = frame_id + '_inf'inf_frame.add_inference_2d_classification(sensor_id='cam',label_id=inf_label_id,classification=entry['class_name'],confidence=entry['confidence'])# Add the frame to the inferences collectioninferences.add_frame(inf_frame)
We can then submit it in much the same way.
al_client.create_or_update_inferences(AL_PROJECT,AL_DATASET,inferences=inferences,inferences_id='demo_model_1',wait_until_finish=True,# The pipeline_mode flag must be set to use mutable inferencespipeline_mode="STREAMING",)
Updating an existing inference set uses the exact same calls as above:
Add new frames/labels: Instantiate
Inferences with new frames, and invoke the
create_or_update_inferences(...)call with the existing
Update existing frames: Instantiate
Inferenceswith the updated frames. Frames are keyed on
frame_id, and labels on a combination of
This means that if a new frame has the same
frame_id as an already existing frame (or similarly a new label for an existing
label_idcombo), it will show up as the most recent version.
Again, as before, make sure to invoke the
create_or_update_inferences(...)call with the existing
Now you have successfully created and updated mutable datasets and inferences!
If you have any questions or concerns, please reach out to the Aquarium team and we can help you out.
Similar to batch uploads, you'll be able to view the status of your streaming uploads in the web app.
If you go to the Project Details page, you'll see a new Streaming Uploads tab (previous batch uploads under your project will still be visible under Uploads):
Each upload ID corresponds to a subset of your dataset/inference set (with the associated frame count + label count).
To view more details on which specific frames/labels are present in a given upload, you can click on the Status (e.g. DONE). A popup will appear with the following info:
In the case of a failed upload, you can debug via the Errors section (which exposes frame-specific debug logs), and download this info to determine which frames/crops may need to be re-uploaded.
If you are running into an error and the error logs are not sufficient to understand how to fix the issue, please reach out to the Aquarium team and we can help resolve your problem.
For now the rest of the Web App UI should be unchanged and new/updated dataset/inference elements will be explorable when the upload enters the
If your org is opted into Mutable Datasets, you should be able to easily clone your legacy datasets as new mutable datasets, via the web app.
If you go to the "Datasets" tab of your "Project Details" page, each of the listed legacy datasets should now have a new teal "Clone as New Mutable Dataset" button:
When you click this button, the cloning will begin:
After a minute or so, if you refresh the page, the new dataset will appear with the prefix "MUTABLE_". The old dataset will also have a tooltip that points to the new dataset:
Depending on the size of your original dataset, it may take some more time for this new mutable dataset to actually be viewable in "Explore" view.
This change allows all elements of a dataset (frames, metadata values, labels, bounding box geometry, etc.) to be added / updated / deleted, but they must still be compatible with the dataset as a whole.
Most notably, the following dataset attributes must remain consistent over time:
Set of known valid label classes
User provided metadata field schemas
Embedding source (i.e.., embeddings are expected to be compatible between all frames in the dataset)
We plan to support changes to all of these in the future. Please let us know if any of them are particularly valuable for you.
When an inference set is uploaded, it will be pinned to a specific version of the labeled dataset, which will default to the most up-to-date version at the time of submission.
Updates to the inference set itself will show up in the UI, but updates to the base dataset (ground truth) won't be incorporated.
Metrics shown will be computed against those pinned dataset labels, and any visualizations of the ground truth will be from that specific version.
When you add something to an issue, you're adding a specific version of it. If that element then gets updated in the dataset, the issue will continue referencing that older version of it.
For example, you might create an issue with example labels of "bounding box too loose." If you go re-label those boxes and update them in the dataset, the issue will still contain the original (poorly drawn) labels, with an icon indicating that it belongs to an older version of the dataset.
It will be available for viewing, but some features (like within-dataset similarity search) may be disabled for out-of-date elements.