Neural Network Embeddings

One of the unique aspects of Aquarium is its utilization of neural network embeddings to help with dataset understanding and model improvement. You may also see embeddings referred to as "features" extracted by neural networks.

Traditional "structured" data consists of numbers, strings, etc. that can be easily indexed in a normal SQL database. Understanding the distribution of this data can be as simple as gathering population statistics and plotting a histogram.

However, when looking at unstructured datatypes like imagery and audio that consist of thousands of values, it's extremely difficult to index, query, and plot the distribution of the data with conventional means. This is where neural network embeddings come in!

Neural network embeddings are a representation of what a deep neural network β€œthought” about a piece of data (from imagery to audio to structured data). When you run a neural network on a datapoint, you can extract an embedding from the activations of intermediate layers in a network.

The embeddings for a datapoint encode complex input data into a relatively short vector of floats. Aquarium can then identify outliers and group together examples in a dataset by analyzing the distances between these embeddings. We can also visualize the distribution of these embeddings (and therefore the underlying data) with the embedding view.

Here's some further readings on embeddings:

How Does Aquarium Use Them?

Aquarium uses embeddings to index and provide insights into your dataset. Embeddings allow Aquarium to:

  • Visualize the distribution of your dataset in the embedding view to find anomalous datapoints, interesting groups of datapoints, and patterns in metadata across the distribution of the dataset. Useful for spotting trends in model failures or data that may be malformed or mislabeled.

  • Compare the distributions of different datasets without requiring labels. For example, differences between train and test sets, labeled training sets vs unlabeled production sets, etc. Useful for finding data where models perform badly because they've never seen that type of data before.

  • Finding similar examples to a target datapoint(s). Useful when you want to find and label examples that are similar to previous situations where the model struggles.

By default, Aquarium has a set of pre-trained neural networks that can be used to generate embeddings on standard unstructured datatypes. A user who uploads data to Aquarium will be able to access embeddings generated by a pretrained Imagenet model.

However, users can also upload their own embeddings. These can be generated from the same pretrained models for the sake of data privacy. They can also be generated by the models that users train on their custom datasets, which tend to have more specialized, better quality embeddings.

How Do I Generate My Own?

Using A Pretrained Model

By default, Aquarium uses its own pretrained models to generate embeddings for user data. However, if you'd like to generate your own embeddings locally, it's quite easy to do so! For example, to generate embeddings on an image, this can be as easy as installing Tensorflow Keras and running the following code:

from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from tensorflow.keras.preprocessing import image
model = MobileNetV2(
input_shape=(224, 224, 3),
def generate_embeddings(input_pil_image):
resized_image = input_pil_image.resize((224, 224))
x = image.img_to_array(resized_image)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
embeddings = self.model.predict(x).flatten().tolist()
return embeddings

Classification Models

We need to generate one embedding per datapoint (ie, per image / audio clip). Generating embeddings from classification models is very similar to the procedure above for the pretrained model.

For most classification models, you want to extract activations from the layer right before the last FC + Softmax layers. The last layer will have dimensions matching the number of classes you're training the model to predict, and the vector of activations of the second to last layer will typically have dimensions between 500 to 5000 long.

For example, in a normal Resnet-50 model, you'd want to extract embeddings from the output of the pool5 layer, which is 2048 long:

Detection Models

For detection models, you need to generate embeddings for:

  • Each datapoint (ie image, lidar scan, etc.)

  • Each labeled object within that datapoint (each 2D / 3D bounding box, instance mask, etc.)

  • Each inference object detected by your model within that datapoint

The way you extract embeddings for each will depend on whether you're using a two-stage or one-stage model. However, most detection models have a non-maximal-suppression (NMS) postprocessing on the outputs of the network, and you may need to slightly modify this algorithm to persist your embeddings.

Two-Stage Models

From https://ronjian.github.io/blog/2018/05/16/Understand-Mask-RCNN

Two-stage models like Faster RCNN or Mask RCNN have a sub-network that proposes bounding boxes (known as the RPN, or region proposal net) and another sub-network that predicts final boxes and classifications from those proposal boxes (known as the region of interest / ROI classification head).

  • For the whole image embedding, you can extract activations from the feature map that the RPN runs on, which should be the smallest feature map at the end of the convolutional backbone (marked in the diagram as CNN). If the activations are really big, then you can do average pooling on it to reduce the size of that feature map to something manageable - we recommend a final size of less than 10,000 elements long.

  • For per-inference object embeddings, you can look at the ROI classification head and take the last FC layer before the box regression + classification (in the diagram, the section that says "fully connected layers"). This is typically quite compact, around 500 elements long.

  • For per-label object embeddings, the procedure is similar to per-inference object embeddings but a little bit trickier, since the boxes proposed by the RPN tend not to be the same as the label boxes. However, two stage model implementations often have an API where you can manually input ROIs instead of using the RPN to propose them. By manually inputting each label box as an ROI, you can run the per-ROI section of the net and get out an embedding for each label box.

There's some room for experimentation here. For example, with per-object embeddings, you can use the output of the ROI align layer (marked in the diagram as "fixed size feature map") instead of the FC layers and see what happens. Similarly, you may want to play with how aggressively you average pool down the per-image feature map.

One-Stage Models

From https://lilianweng.github.io/lil-log/2017/12/31/object-recognition-for-dummies-part-3.html

TBD, contact [email protected] for more info.

Segmentation Models

From https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/

Generating embeddings is similar to the classification task in that you only need to provide one embedding per datapoint (image, lidar scan, etc.).

However, the structure of the network is somewhat different, since there's no downsampling to a compact feature representation at the end of the network. Instead, most segmentation networks will spatially downsample an input image to a somewhat compact representation in the middle of the network, then progressively upsample this back to the output segmentation map at the end of the network, which has the same dimensions as the input image.

These segmentation maps at the end of the network are very large and are not great for embedding analysis! Instead, you will want to extract the embeddings from the most compact layer in the middle of the network. Looking at the case of the diagram above depicting a U-Net model, the 54x54x512 channel on the bottom right of the diagram, right before the up-conv operation. The size of this layer is still very large (1,492,992 elements for U-Net), so you will want to average pool this layer down to something more reasonable, ideally less than 10,000 elements long.

Again, there's still room for experimentation here. You can try taking the layer with the full image width and height but before the class-wise segmentation map output, then try aggressively average pooling it down to a much smaller size.

Still Confused? Ask For Help!

Embedding generation using your own model can be a little more involved depending on your task and model architecture. Reach out to [email protected] if you still have questions and we can help out!