In general, we aim to provide tools to diagnose performance based around how a shipped model would perform, as opposed to overall model performance. For example, many high level metrics like mAP or AUC aim to capture the performance of a model across the spectrum of various confidences. This is ideal when you're trying to evaluate a model's overall performance. However, for an ML product, you often make an informed decision about your preferred precision vs recall trade-off and pick a point on the curve that represents that. You usually ship a confidence threshold.