Data Privacy and Anonymous Mode

A common question we are asked is: what data do we expose to Aquarium and how do we make sure it's safe?

To state it plainly: Aquarium works without needing to store or host sensitive raw data on Aquarium servers.

Aquarium's Data Model

Aquarium needs the following information to operate:

  • Raw data (images, pointclouds, etc.) in your dataset OR URLs to raw data hosted in your system.

  • Labels (bounding boxes, classifications, etc.) on your dataset.

  • Inferences from your model on your dataset.

  • Any additional metadata (timestamps, device id, etc.) that you'd like Aquarium to index.

  • Optionally, embeddings generated from your own neural network for each image and each object. If this is not available, Aquarium can use a set of pre-trained models to generate embeddings with generally good results.

Aquarium provides insights primarily based on metadata such as data URLs, labels and inferences, and embeddings. If a user hosts their data on their own S3 bucket / similar service, they can provide a URL to that data instead of the data itself. This has the benefit of reducing the amount of data sent to Aquarium and makes uploads through our client library much faster.

Raw data is only accessed for two purposes: visualization and embedding generation.

  • When users explore their datasets in Aquarium, the client-side code references a data URL to render a visualization in their browser.

  • If users do not provide their own embeddings, Aquarium will load the images a single time to run pretrained models on the raw data to generate embeddings.

If a user provides access-controlled data URLs and embeddings they generated themselves, Aquarium servers won't require access to the raw data to derive useful insights. In other words, a user can have full access to the functionality of Aquarium without ever exposing their raw data! Which leads us to Anonymous Mode.

Anonymous Mode

Aquarium currently doesn't have an on-prem offering. However, most users don't actually need on-prem (outside of specific health care and defense contexts) because Aquarium's Anonymous Mode satisfies their requirements for data security.

Anonymous Mode is a mode of operation that allows users to get full value out of Aquarium while protecting their sensitive data. To do so, users must provide access-controlled URLs that only they are allowed to see, and then submit their own embeddings so Aquarium doesn't need to generate them.

Generating Access-Controlled URLs

There are many ways to share data -- here are a few common approaches, ordered from least to most locked-down

Authenticated URLs (Aquarium Auth Headers)

Users can provide URLs that are authenticated through normal authentication measures, such as by providing URL signing endpoints for raw data stored on their own S3 / GCS buckets. Requests from Aquarium will include auth headers identifying itself, so users can allow Aquarium to see data without exposing it to the wider world.

Authenticated URLs (Company Identity Provider)

If your company has more complex access controls, we can also discuss integration with Identity Providers like Okta, and include that information when requesting resources. This means that all requests must come from an environment where the user has authenticated with the company's identity provider.

IP Restricted URLs

If users do not want Aquarium to have read access to their data at all, they can restrict permissions on their URLs to only be accessible to users within their corporate network or VPN. This way, Aquarium's servers won't have read access to the raw data. When someone uses Aquarium from within an approved network, the Aquarium browser client will be able to access and render the raw data correctly.

This can also be done in conjunction with other access control schemes, such as authenticated image signing URLs.

Serving Locally Hosted Images

Aquarium's only requirement is that images are available via an HTTP request from the user's browser. Users can host data from their own computer if needed, which is particularly useful for fast experimentation workflows. You can serve a folder of local images with a simple HTTP server by installing Python or node, and running one of the following commands:

If you're working with semantic segmentation or point cloud data, you'll need to use a local file server that supports Cross-Origin Resource Sharing (CORS). The recommended NPM package below supports it as written.

If you are unable to use an NPM package, please reach out and we can get you set up.

# Python 3
python3 -m http.server 5000
# Python 2
python -m SimpleHTTPServer 5000
npx http-server --cors='*' --port=5000

Afterwards, you can submit URLs to Aquarium that are formatted like https://localhost:5000/{image_path}.jpg When the Aquarium client tries to render these URLs, it will load the images served by the Python server on your local machine with minimal latency.

Generating Embeddings

Aquarium needs access to raw data to generate embeddings when they are not provided by a user. However, if a user generates and submits their own embeddings, Aquarium can do all of its analysis on the user-provided embeddings. For more instructions on how to generate your own embeddings, look here!