Data Sharing Methodologies

Methodologies for properly allowed Aquarium to render your data

If you don't see your specific configuration mentioned here, please reach out and we'll point you to the right cloud vendor resources!

To state it plainly: Aquarium works without needing to store or host sensitive raw data on Aquarium servers

Aquarium's Data Model

Aquarium needs the following information to operate:

Raw data (images, pointclouds, etc.) in your dataset OR URLs to raw data hosted in your system.
Labels (bounding boxes, classifications, etc.) on your dataset.
Inferences from your model on your dataset.
Any additional metadata (timestamps, device id, etc.) that you'd like Aquarium to index.
Optionally, embeddings generated from your own neural network for each image and each object. If this is not available, Aquarium can use a set of pre-trained models to generate embeddings with generally good results.

Aquarium provides insights primarily based on metadata such as data URLs, labels and inferences, and embeddings.

If a user hosts their data on their own S3 bucket / similar service, they can provide a URL to that data instead of the data itself. This has the benefit of reducing the amount of data sent to Aquarium and makes uploads through our client library much faster.

Raw data is only accessed for two purposes:

Visualization
Embedding Generation

Before anything else, we should figure out data sharing, since you probably aren't working with public images. Aquarium offers several easy ways to securely work with your data assets on our platform. If you don't see a solution here, reach out -- we’ve worked with many enterprise IT teams on unique security schemes too.

Options with an asterisk (*) will only allow your users to view raw data (images, point clouds, etc.) and will never make them accessible to Aquarium's servers.

This increased data privacy does mean that users will have to provide their own "image embeddings," as we won't be able to compute them for you. Embeddings are used for some features around clustering and similarity search.

Your data is free to share publicly.

This is the easiest, assuming you're working on data without any security restrictions. Anywhere you need to provide a URL to an asset, just provide any URL that's accessible on the public internet. For example, the above toy dataset example uses public URLs like the following:

image_url = "https://storage.googleapis.com/aquarium-public/quickstart/pets/imgs/Abyssinian_1.jpg"

Your data needs to be kept secure, and it's ok for Aquarium to host a copy.

By allowing Aquarium to issue you a storage bucket and host your data, it allows Aquarium to manage a lot of the annoying details that go into a producing a snappy user experience. Browser cache control headers, CORS settings, user access credentials -- we'll handle all of these for you.

We'll issue you a Google Cloud Storage bucket of the form gs://aquarium-customer-abcd1234/ and a Google Cloud service account with read/write permissions for that bucket. If your organization uses Google Cloud, we can also directly grant permissions to your admin users.

To upload data:

# Install gsutil, google cloud's storage CLI utility
# https://cloud.google.com/storage/docs/gsutil_install#sdk-install

gsutil --version

# Activate the service account credentials
# https://cloud.google.com/sdk/gcloud/reference/auth/activate-service-account

gcloud auth activate-service-account \
    aquarium-customer-you@aquarium-266018.iam.gserviceaccount.com \
    --key-file /path/to/credentials.json

# Copy directories up! This command will recursively copy a directory
# Feel free to check out the docs for more options:
#    https://cloud.google.com/storage/docs/gsutil/commands/cp

gsutil -m rsync -r \
    ./projectA/imgs/ gs://aquarium-customer-abcd1234/projectA/imgs/

To reference the data:

image_url = "https://storage.cloud.google.com/aquarium-customer-abcd1234/path/to/image.png"

Your data is in a private storage bucket, and you don't want Aquarium to ever access the raw data.

Because the underlying raw data is only needed for visualization purposes, you can provide bucket paths that point to secure resources. Then, when your users want to view them in the application, they can use local credentials to view the data. Neither the credentials nor the data will ever leave your users' browser / local device.

To reference the data, simply use the bucket path as the data url:

image_url = "s3://yourbucket/path/to/img.jpg"
image_url = "gs://yourbucket/path/to/img.jpg"

Then, each user can go to https://illume.aquariumlearning.com/settings/user to point to a local credentials file. This file is expected to match the formats provided by your cloud provider's admin/IAM console.

Here is a guide to generating local credentials in AWS!

Doing this securely requires modern browser capabilities that aren't available in all browsers yet. We recommend upgrading to the latest version of Google Chrome or Microsoft Edge if you want to use this data sharing scheme.

Anonymous Mode

Aquarium currently doesn't have an on-prem offering. However, most users don't actually need on-prem (outside of specific health care and defense contexts) because Aquarium's Anonymous Mode satisfies their requirements for data security.

Anonymous Mode is a mode of operation that allows users to get full value out of Aquarium while protecting their sensitive data. To do so, users must provide access-controlled URLs that only they are allowed to see, and then submit their own embeddings so Aquarium doesn't need to generate them.

To learn more about Anonymous mode, here is a page in our docs!

Embedding Generation

Aquarium can generate embeddings for your dataset as long as you are NOT using Anonymous Mode. Aquarium needs access to your raw data in order to generate these embeddings.

When users explore their datasets in Aquarium, the client-side code references a data URL to render a visualization in their browser.
If users do not provide their own embeddings, Aquarium will load the images a single time to run pretrained models on the raw data to generate embeddings.

If a user provides access-controlled data URLs and embeddings they generated themselves, Aquarium servers won't require access to the raw data to derive useful insights. In other words, a user can have full access to the functionality of Aquarium without ever exposing their raw data!

PreviousAnnouncements NextGenerate Local Credentials from AWS

Last updated 1 year ago