Inspecting Model Performance

Understanding your model performance and how to analyze them through the Model Metrics View

Introduction to Model Metrics View

This page will cover how to use the different features within Aquarium so your team can analyze your model's performance and more efficiently find insights in the underlying datasets.

To get started, select a dataset and at least one inference set in the project path underneath the top navigation bar.

The Model Metrics View is split into two tabs which we elaborate more on in later sections:

Scenarios

Using Scenarios requires setting up model performance segments within your dataset. Learn more about organizing your data with Segments.

The scenarios tab provides a summary view of your model's performance against pre-defined subsets of your dataset.

  • Scenarios allow you to define a target threshold for your models' performance against a known set of frames, and then evaluate all inference sets against those thresholds. This may be as simple as reaching as target F1 score on the test set, or as complex as multi-metric pass/fail regression tests against domain specific problems.

Metrics

The Metrics tab provides a high level overview on the overall dataset, or an individual segment. Metrics provides additional drill through capability over the Scenarios view, including:

Configuring Thresholds for Model Performance Segments

When you create a Model Performance type segment, you can set thresholds for both precision and recall to easily evaluate the performance of an inference set. These thresholds are especially useful when used with Regression Test type segments because your teams can query the API and retrieve a pass/fail result.

Example Payload From Querying the Results of a Regression Test
# note payload also always includes 
# similar result for your entire dataset named 'All Frames'
# also note the succeeds value will say true or false 
# depending on threshold criteria

[
   {
      "segment_name":"Far away planes",
      "frame_count":12,
      "precision":{
         "score":0.9716981132075472,
         "threshold":0.8,
         "succeeds":true
      },
      "recall":{
         "score":0.7410071942446043,
         "threshold":0.8,
         "succeeds":true
      },
      "f1_score":{
         "score":0.8408163265306122,
         "threshold":0.7,
         "succeeds":true
      },
      "false_positives":{
         "score":0
      },
      "false_negatives":{
         "score":33
      },
      "total_confusions":{
         "score":3
      },
      "segment_type":"Regression Test",
      "uuid":"UUID_Value"
   }
]

Setting a Threshold for a Model Performance Segment

There are two ways to navigate to the page to set a model performance segment's thresholds.

  1. From a segment overview page, click on the Metrics tab

2. From the Model Metrics View, click on the fly out button in one of the Scenario cards

On the metrics page, for each metric, you'll see a dot plotted representing each inference set related to the base labeled dataset.

Once you have navigated to the Metrics Page, to set a threshold:

  1. Enter in a number or use the arrows to set a value from 0.0 to 1.0 to represent the desired threshold value (Must type 0 first, ie. 0.8)

  2. Click out of the input box anywhere on screen for the value to take effect

You'll notice once you set a threshold your values will turn green or red depending on if your inference set metrics are over or under that threshold.

Once you set the thresholds, you'll also see dotted lines that represent the threshold values superimposed on the PR curves in the Model Metrics View:

Scenarios Tab

The Scenarios tab summarizes your models' performance across all defined Model Performance Segments.

Model Performance Segments are grouped into three primary categories

  • Splits

    • Always includes a segment card for All Frames in the dataset.

    • Typically the test, training and validation subsets of your dataset.

  • Regression Tests

    • Sets of frames within your dataset that the model must perform to a certain threshold on in order to be considered for deployment.

    • Regression tests might be tied to overall business goals, specific model development experiments, domain specific difficult performance scenarios, etc.

  • Scenarios

    • Any other subset of frames you'd like to evaluate the model's performance on (e.g. data source, labeling provider, embedding clusters, etc.)

  • View the performance of any uploaded inference set compared to the segment-specific precision, recall and F1 target thresholds,

  • Modify the metric target thresholds,

  • Manage the segment's elements and segment metadata,

From the scenarios tab select up to two inference sets to compare model performance.

  • Initially, the metrics calculations will respect the project-wide default IOU and confidence settings.

  • Otherwise, use the metrics settings to adjust the confidence and IOU thresholds for both models, or either model independently.

Click the fly out button to open the Segment Details view.

From here you can:

  • View the performance of any uploaded inference set compared to the segment-specific precision, recall and F1 target thresholds,

  • Modify the metric target thresholds,

  • Manage the segment's elements and segment metadata,

Click anywhere in the segment card to open the Metrics tab, pre-filtered to only the frames in that specific segment.

Metrics Tab

Here, you can see a high level overview of the model's performance on the base dataset or scenario subset in a few forms:

You can move the slider for parameters like confidence threshold and IOU matching threshold, and the metrics will recompute to reflect the model's performance with those parameters. You can also change the Metric Class.

Many Class Metrics View

When the number of classes exceeds 25, the model metrics view switches from the grids above to a filterable table format for better legibility.

Selecting a row in the Classification Report (left) will filter the Confusion Matrix (right) to only rows where either GT or Prediction matches the selected class. Selecting a row in the Confusion Matrix will show examples of those types of confusions, and sort by the amount of confusion.

Understanding the Confusion Matrix

The confusion matrix in the Metrics tab is extremely useful for identifying label/data quality issues. While we have a whole guide on how to move through a data quality workflow, this section will focus specifically on how to use all the features of the matrix.

For this example we

Filtering Buttons

In the Metrics tab, these filtering buttons allow you to quickly filter and view subsets of your dataset.

For example, when you click on FN (False Negatives), you'll see specific cells highlighted in the matrix. In addition after the matrix cells are highlighted, you'll see those examples populate down below for review.

The buttons Confused As and Confused From reveal a dropdown where you can filter on specific classes. When selecting a class, you'll again see cells in the matrix highlighted and examples that meet the criteria populating down below.

Toggle Buttons

Absolute/Percentage Toggle Buttons

The toggle buttons on the left allows you to change the value displayed on the cell:

  • Absolute is the number of crops that meet the label/inference class criteria.

  • Percentage depends on which option is selected on the other toggle button (row, column, value).

    • If Row is selected, cell percentage represents the count of the crops in the cell compared to the row

    • If Column is selected, cell percentage represents the count of the crops in the cell compared to the column

    • If Value is selected, cell percentage represents the count of the crops in the cell compared to the total number of crops across all classes

Row/Column/Value Toggle Buttons

The toggle buttons above the confusion matrix change the way you see the cell values colored and the numbers that display on cell hover.

The darker the color, the larger the percentage of the Row/Column/Overall Value that specific cell represents.

Also, depending on the toggled option, the denominator that is displayed on cell hover will reflect the total count per row, column, or for the entire dataset.

Comparing Two Inference Sets

In the Model Metrics View it is possible to compare two inference sets at once.

When two inference sets are selected, both the value displayed on a cell and the values displayed when hovering over a cell will appear different than with just a single inference.

It's worth noting, whatever inference set is selected first up top from the drop down is the one you will see listed above the confusion matrix and does have an effect on the results you will see in the confusion matrix.

Taking a look at coloring in the matrix pictured above, the darker the blue the better, the darker the red the worse.

Breaking this statement down, the coloring depends on if we are looking at values on the main diagonal or off of the diagonal.

Here we mean the diagonal that represents the correct predictions from the model:

On the diagonal, any positive value is good and signifies that value is the increase in the number of correct classifications in the second inference set compared to the first. So since it is a positive change, positive numbers on the diagonal will be blue. On that same train of thought, any negatives on the diagonal signify a decrease in performance and are colored red.

For any value outside the diagonal, the colors actually represent the opposite because outside the diagonal, each cell represents a specific kind of error. So positive numbers actually represent MORE misclassifications in the second inference set compared to the first. Whereas negative numbers represent less error = better performance = blue colored cells.

Understanding the Values on Hover

Looking at the image above, when comparing two inference sets the format of the message on hover for a cell is slightly different.

The message for this cell reads:

delta - straight + 10 (2 -> 12) 

This reads as: for objects classified as delta but labeled straight, the second inference set had 10 more of these failures than the first inference set. The first inference set had 2 examples of this particular failure and the second selected inference set has 12 examples of this failure. (2 + 10 = 12)

Last updated