Data Sharing Methodologies

Methodologies for properly allowed Aquarium to render your data

If you don't see your specific configuration mentioned here, please reach out and we'll point you to the right cloud vendor resources!

To state it plainly: Aquarium works without needing to store or host sensitive raw data on Aquarium servers

Aquarium's Data Model

Aquarium needs the following information to operate:

  • Raw data (images, pointclouds, etc.) in your dataset OR URLs to raw data hosted in your system.

  • Labels (bounding boxes, classifications, etc.) on your dataset.

  • Inferences from your model on your dataset.

  • Any additional metadata (timestamps, device id, etc.) that you'd like Aquarium to index.

  • Optionally, embeddings generated from your own neural network for each image and each object. If this is not available, Aquarium can use a set of pre-trained models to generate embeddings with generally good results.

Aquarium provides insights primarily based on metadata such as data URLs, labels and inferences, and embeddings.

If a user hosts their data on their own S3 bucket / similar service, they can provide a URL to that data instead of the data itself. This has the benefit of reducing the amount of data sent to Aquarium and makes uploads through our client library much faster.

Raw data is only accessed for two purposes:

  • Visualization

  • Embedding Generation

Data Sharing

Before anything else, we should figure out data sharing, since you probably aren't working with public images. Aquarium offers several easy ways to securely work with your data assets on our platform. If you don't see a solution here, reach out -- we’ve worked with many enterprise IT teams on unique security schemes too.

Options with an asterisk (*) will only allow your users to view raw data (images, point clouds, etc.) and will never make them accessible to Aquarium's servers.

This increased data privacy does mean that users will have to provide their own "image embeddings," as we won't be able to compute them for you. Embeddings are used for some features around clustering and similarity search.

Your data is free to share publicly.

This is the easiest, assuming you're working on data without any security restrictions. Anywhere you need to provide a URL to an asset, just provide any URL that's accessible on the public internet. For example, the above toy dataset example uses public URLs like the following:

image_url = "https://storage.googleapis.com/aquarium-public/quickstart/pets/imgs/Abyssinian_1.jpg"

Anonymous Mode

Aquarium currently doesn't have an on-prem offering. However, most users don't actually need on-prem (outside of specific health care and defense contexts) because Aquarium's Anonymous Mode satisfies their requirements for data security.

Anonymous Mode is a mode of operation that allows users to get full value out of Aquarium while protecting their sensitive data. To do so, users must provide access-controlled URLs that only they are allowed to see, and then submit their own embeddings so Aquarium doesn't need to generate them.

To learn more about Anonymous mode, here is a page in our docs!

Embedding Generation

Aquarium can generate embeddings for your dataset as long as you are NOT using Anonymous Mode. Aquarium needs access to your raw data in order to generate these embeddings.

  • When users explore their datasets in Aquarium, the client-side code references a data URL to render a visualization in their browser.

  • If users do not provide their own embeddings, Aquarium will load the images a single time to run pretrained models on the raw data to generate embeddings.

If a user provides access-controlled data URLs and embeddings they generated themselves, Aquarium servers won't require access to the raw data to derive useful insights. In other words, a user can have full access to the functionality of Aquarium without ever exposing their raw data!

Last updated