Once you've uploaded a dataset through the Aquarium API, you can go to https://illume.aquariumlearning.com/ to begin to explore and understand what's in your dataset.
To begin, select your project with the dropdown from the top-right. Then select a dataset to look at using the "Datasets" dropdown and click "Search" to populate the page.
You can switch between different dataset views using the four icons to the right of the search bar.
You can use the "Display Settings" button to toggle settings like label transparency, number of datapoints displayed per row, etc.
The first and default view for dataset exploration is the Grid View. This view lets you quickly look through your dataset and understand what it "looks like" at a glance. Labels, if available, are overlaid over the underlying data.
You can click into an individual datapoint to see more details, such as its metadata (timestamp, device ID, etc.) and its labels.
You can also select multiple elements to group into issues. You can either select them one-by-one by clicking the circle on the top-left of each image card, or you can select a range of multiple items at once by shift-clicking.
The second view for dataset understanding is the Histogram View.
You often want to view the distribution of your metadata across your dataset. This is particularly useful for understanding if you have an even spread of classes, times of day, etc. in your dataset. Simply click the dropdown, select a metadata field, and you can see a histogram of the distribution of that value across the dataset.
The third view for dataset understanding is the Embedding View.
The previous methods of data exploration rely a lot on metadata to find interesting parts of your dataset to look at. However, there's not always metadata for important types of variation in your dataset. We can use neural network embeddings to index the raw data in your dataset to better understand its distribution.
The embedding view plots variation in the raw underlying data. Each point in the chart represents a single datapoint. In the Image view, each point is a whole image or "row" in your dataset. In the Crop view, each point represents an individual label or inference object that is in a part of the image.
The closer points are to each other, the more similar they are. The farther apart they are, the more different. Using the embedding view, you can understand the types of differences in your raw data, find clusters of similar datapoints, and examine outlier datapoints.
You can also color the embedding points with metadata to understand the distribution of metadata relative to the distribution of the raw data.
To select a group of points for visualization, you can hold shift + click and draw a lasso around a group of points. You can then scroll through individual examples in the panel with the arrows. You can also adjust the size of the detail panel by dragging the corner.
It's also possible to change what part of the image you're looking at in the preview pane. You can zoom in and out of the image preview by "scrolling" like you would with your mouse's scroll wheel / two finger scroll. You can also click and drag the image to pan the view around the image.
Coloring the embedding view with metadata can help you understand the distribution of metadata relative to the distribution of the raw data. This can be used to spot label errors / inconsistencies. When similar examples have different labels, this can indicate a problem with the labels.
In the following example, two very similar datapoints of sitting people are labeled as very different classes:
It's possible to scroll each view indefinitely to see the whole dataset, but that takes a lot of time. Using the Query Bar, you can add SQL-like filters to only look at interesting slices of the dataset.
Simply click on the query bar and you will see a dropdown containing metadata that you can filter with. Typically, users will use the Histogram View to see what the distribution of values for each metadata and then use the Query Bar to filter down to a certain value.
Note: You need to press "enter" after each filter that you add, and you need to click "Search" after you're done for a query to run!
Query filters stack on top of each other. In the following example, we are filtering for datapoints with more than 10 labels with the class "Pedestrian:"
You can also run queries on user uploaded metadata - these metadata fields are denoted with a
user__ prefix in the dropdown.
The Query Bar works in all parts of the tool (Grid View, Histogram View, Embedding View, and Model Metrics View), so you can filter metrics / visualizations to the subset of the dataset you're most interested in seeing.
You can also use various advanced operators while constructing the query.
By default, the search filters will search for exact matches. For example,
user__split:train will search for datapoints where the value of
user__split is exactly
! will exclude the search term. For example,
user__split:!train will exclude all examples where the value of
user__split is exactly
~ will do fuzzy / approximate search. For example,
user__split:~train will include all examples where
user__split contains the value
You can combine together filter operators. For example, the
user__split:!~train will include datapoints where
user__split doesn't contain the value
Filtering by issues has a particularly unique syntax. For example, you can filter for examples based on whether or not they have membership in any issue, using the query syntax of
inIssue:*. You can also do regex based queries on issue names, such as
notInIssue:.*bad.* that will include all datapoints that don't belong to issues with "bad" in the middle of the issue name.
Queries based on model performance metrics are also available, prefixed by
otherInferenceMetrics. Since these metrics are dependent on multiple parameters, they have a more complex query syntax.
Let's take a main query of
inferenceMetrics.false_positives:>2 (get me all images with more than 2 false positive detections). These metrics can depend on additional parameters, such as:
What is the minimum iou threshold to match a label to a detection?
What is the minimum confidence threshold to accept a detection?
Do we want that metric for all classes, or just one?
The query syntax then takes the answer to each of those as additional, comma-separated fields:
Right now, those parameters need to be from a set of specific values:
iou and confidence can be any multiple of 10 under 100: [0, 10, 20, ..., 80, 90]
classification can be any class name in your project, or the special value "weighted avg" which will consider all classes, and do a weighted average if the metric requires it.
To simplify the user workflow for most cases, a default configuration
iou=50,confidence=0,classification=weighted avg is automatically appended after pressing enter if no custom values are entered.
All images where weighted average precision is >0.6, only counting detections with >= 80% confidence, where an iou of 40% is required to match.
All images where the f1 score for the dog class is >0.6, only counting detections with >= 50% confidence, where an iou of 70% is required to match.
All images where there are more than 5 false positives of any class, only counting detections with >= 50% confidence, where an iou of 50% is required to match.
All images where there are more than 5 false positives of the dog class, only counting detections with >= 50% confidence, where an iou of 50% is required to match.
If you want to change the color associated with a specific label, you can do so by clicking on the color square next to the label in the "Display Settings" menu: