Finding Similar Elements Within a Dataset

Oftentimes, when you find element(s) of interest within your dataset or inference set, you may also want to identify other similar elements (for instance, to export for relabeling).

While it's possible to only use the embedding view to find nearby points, the embedding view has been reduced down to 2 dimensions for visualization purposes, so its notion of "distance" is not entirely accurate.

Aquarium provides two ways of finding nearby elements in actual embedding space:

  1. Related Images: For a single, specific element.

  2. Segment-based Similarity Search: For a set of "seed" elements. This is a similar approach to Collection Campaigns, but for elements within the same dataset (rather than new unlabeled elements).

If you select a specific element from your dataset or inference set, this new feature allows you to find similar, existing elements from that same set.

Note: This feature is not supported in the following situations (the web app will display a warning in these cases):

  • Older datasets (since they didn't undergo the necessary post-processing)

  • Segments with elements from multiple datasets

If you've identified a few problematic elements in a Segment, you may want an easy way to "grow" the issue by finding other similar elements.

Once an issue is created, you can generate similar elements by going to the Similar Dataset Elements tab and clicking the Calculate Similar Dataset Elements button as follows:

You can then select and add the elements you want to your original issue.

Every time you add or remove elements to your issue, you have the option of recalculating its similar elements.

If you've already created an issue (e.g. identified a few problematic labels or model failures from the Explore view of the app), you can iterate through the following loop to curate and grow it:

  1. Calculate similar dataset elements.

  2. Add some desired subset of the suggested "similar elements" to the original issue.

  3. Rinse and repeat!

This flow can used to maximize the effectiveness of your Collection Campaigns. For collection campaigns, a well-curated set of issue elements will help you achieve better sampling results. Ordinarily, this can be a lengthy and tedious manual process, but similarity search will help speed that up.

Last updated