Metrics Methodology

Aquarium automatically computes performance metrics for common tasks. Some of these, like image classification, are well defined and consistent across most users and domains. Others, like object detection, tend to vary in implementation from one library to another.

In general, we aim to provide tools to diagnose performance based around how a shipped model would perform, as opposed to overall model performance. For example, many high level metrics like mAP or AUC aim to capture the performance of a model across the spectrum of various confidences. This is ideal when you're trying to evaluate a model's overall performance. However, for an ML product, you often make an informed decision about your preferred precision vs recall trade-off and pick a point on the curve that represents that. You usually ship a confidence threshold.

Here are some more details about our interpretation of common tasks and how we compute their metrics. If you have different logic you use to evaluate performance, you can also provide your own custom metrics.

Object Detection

In order to offer a confusion matrix (vs just FP/TP/FN for each class), we have to consider all detection and label classes simultaneously, as opposed to considering class each individually like you might see in pycocotools and some other libraries/benchmarks.

That means we model it as a two-phased process:

  • A global matching stage (with unmatched entries implicitly matching to 'background')

  • Once we have them matched, doing standard pairwise TP/FP metrics on them, like in a typical classification task.

However, the matching phase has several interpretations, and the most "correct" one seems to vary a bit based on domain and model architecture. This is our default metrics computation logic, presented in roughly computational order:


We assume you're evaluating the model with some thresholds to either consider or ignore a detection. Aquarium currently supports two threshold values:

  • Confidence Threshold: A value between 0.0 and 1.0, such that all detections below the threshold will be discarded.

  • IOU Threshold: A value between 0.0 and 1.0, such that all detections whose best-match IOU to ground truth is below a threshold will be discarded.

The confidence threshold is necessary to evaluate your model as it would be deployed. The IOU threshold allows you to punish very-poorly localized detections as if the model never detected them.

Filter and Adjust Labels

Sometimes you have more information in the ground truth labels than your model tries to predict. Based on the project's label class map, we filter and/or re-classify the raw labels to match the task the model is performing.

Pairwise IOU Calculation

For each pair between the labels and detections, we compute an IOU (intersection-over-union) score between 0.0 and 1.0 for how well the geometry overlaps.

Axis Aligned Bounding Boxes use well defined IOU techniques to determine overlap.

Multi-polygon (Polygon List) type labels first takes the geometry, and attempts where necessary to fix minor geometry issues (such as bowties at corners) to make them true polygons. After that, we use the shapely library to compute the intersection and union of the two collections of polygons.

3D Cuboid type labels compute the 2D Birds-Eye View IOU. To do so, we first apply all pose transforms (rotation + translation) to the cuboid geometry, then take the X and Y components to construct a 2D polygon. We then perform the same operations as for the multi-polygon case.

Matching Logic

We first apply the confidence threshold, where we discard all inferences below that threshold.Then we match GT labels with the remaining inferences based on optimal geometry overlap / localization (typically optimizing IOU).

To properly punish multiple detections, we consider each label and inference to be "used up" based by their best-localization matching. That is, a label can only be matched to either 0 or 1 inference.

In most cases, that tends to be beneficial and more accurate to real world performance, where objects rarely perfectly overlap and false-positives are bad. In others, especially ones where false positives are not a major problem, it can be a sub-optimal metric. In these cases, we recommend you provide your own custom metric in the form of a per-frame confusion matrix.

At the end of this phase, we have two arrays of equal length (commonly referred to y_true and y_pred for classification report utilities). Each array contains corresponding ground truth and inferred classes, with an implicit "background" value for when a match didn't occur.

Aggregation + Report Generation

We ultimately want to present these as f1/precision/recall reports. To do so, we first compute the confusion matrix by counting how many instances exist of a given [label_class, predicted_class] pair. If you have filters on the dataset, we will only include instances from images/frames that would be returned by that filter.

To help match behavior of common systems like pycocotools, we default to ignoring results from images/frames with no ground truth labels present -- that is, don't punish the system for correctly detecting on unlabeled data. If you know that your data is fully labeled, you can modify this behavior in the project settings.

Once we have the full confusion matrix, we can then trivially convert that into a classification report. In the case where a metric would have required dividing by zero (e.g., recall when true positives and false negatives are both zero), we treat the final metric as zero.

Confusion Matrix Queries

When clicking into a confusion matrix cell to view specific examples, we use the same logic to determine whether a label <> inference pair would have contributed to that cell of the confusion matrix.


Why not NMS? Why a confidence threshold instead of sorting by descending confidence?

Non-Maximum Suppression (NMS) solves a similar problem of prioritizing detections based on a combination of confidence and IOU. The main reason we don't use it directly is that we assume most customers are doing NMS somewhere in their stack, and that ultimately they ship a confidence threshold.

So assuming they're going to pick some cutoffs for what they'd want to run in production, we'd want to start with a large filtering of lower-confidence first. This way we're computing metrics that resemble the final shipped product's performance, as opposed to a metric that more abstractly measures the model's generic performance.

We've seen that if your implementation already does some NMS internally, then at this point you don't care as much about the difference between 0.92 confidence and 0.85 confidence, and so would rather prioritize localization (especially in domains with lots of partial overlap).