Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output optimal confidence threshold based on PR curve #2048

Closed
decent-engineer-decent-datascientist opened this issue Jan 27, 2021 · 17 comments · Fixed by #2057
Closed

Output optimal confidence threshold based on PR curve #2048

decent-engineer-decent-datascientist opened this issue Jan 27, 2021 · 17 comments · Fixed by #2057
Labels
enhancement New feature or request

Comments

@decent-engineer-decent-datascientist

🚀 Feature

Could we print out the maximum F1 score and the associated confidence threshold at the end of training?

Motivation

This is something that should be done for every custom model, and the data is readily available after the final mAP calculations, and PR curve drawing.

Pitch

Filter detections at different score/confident thresholds, calculate P/R/F1, and then print the optimal threshold (max f1).

Alternatives

Instead of printing max PR, maybe write a csv in the run directory, containing metrics at different thresholds.

@decent-engineer-decent-datascientist decent-engineer-decent-datascientist added the enhancement New feature or request label Jan 27, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 27, 2021

@decent-engineer-decent-datascientist yes that's an interesting idea. Unlike P and R, F1 should have a stable max point (P and R have unstable max and minima at 0.0 and 1.0 confidence thresholds). I'm not sure if the currently saved metrics suffice for this sort of analysis. At the moment we only save metrics at a fixed confidence, given by --conf for mAP and 0.1 hard coded for P and R (and also F1):

pr_score = 0.1 # score to evaluate P and R https://github.com/ultralytics/yolov3/issues/898

@glenn-jocher
Copy link
Member

@decent-engineer-decent-datascientist evaluations at different PR scores might also help contribute to an additional feature, which would be an interactive P-R curve (1.0 conf top left and 0.0 conf bottom right).

@Ownmarc
Copy link
Contributor

Ownmarc commented Jan 27, 2021

Oh yes I like this idea. I think it would also be interesting to vary the IoU threshold too and find the best settings to maximize the F1 score.
Could also give the option to weight classes differently in the calculation of this score.
Adding some flexibility on how "best.pt" is determined without requiring the user to jump in the code would be huge for this repo I think!

@decent-engineer-decent-datascientist

@Ownmarc love the expansion to IOU and best.pt. Definitely something I could use.

@glenn-jocher
Copy link
Member

@decent-engineer-decent-datascientist @Ownmarc I reviewed the code and this seems feasible, though probably best implemented at a single IoU threshold (i.e. 0.5 only), rather than all 10 IoU thresholds (0.50 : 0.05 : 0.95).

The current implementation computes P and R at a fixed score/conf (0.1) only at 0.50 IoU threshold. But luckily the operation that does this is vectorize-able so I can extend the P and R op to a vector of confidences, say np.linspace(0, 1, 100). This should allow for some really cool P, R and F1 plots as a function of confidence.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 27, 2021

About best.pt, these checkpoints are saved anytime a new best fitness is observed, with default fitness defined as:

yolov5/utils/metrics.py

Lines 12 to 16 in f59f801

def fitness(x):
# Model fitness as a weighted combination of metrics
w = [0.0, 0.0, 0.1, 0.9] # weights for [P, R, [email protected], [email protected]:0.95]
return (x[:, :4] * w).sum(1)

In the past we used to define fitness as inverse loss, though interestingly the min loss checkpoint rarely coincided with the highest mAP checkpoint, so we switched it to the current scheme following user feedback. It's possible that an array of 'best' checkpoints might better serve the varying community needs i.e.:

weights/
  last.pt
  best_map.pt
  best_f1.pt  # powered by our new all-confidence measurement
  best_loss.pt

@glenn-jocher
Copy link
Member

@Ownmarc @decent-engineer-decent-datascientist I've opened up PR #2057 on this topic. YOLOv5m on COCO initial results below, evaluated and plotted at 0.50 IoU:

F1 Curve

image

Precision Curve

image

Recall Curve

image

@decent-engineer-decent-datascientist

Oh gosh, that's so pretty. Do you have any worries merging this with master (performance, etc.)? The checks look good and no conflicts.

@glenn-jocher
Copy link
Member

@decent-engineer-decent-datascientist the plotting time is only incurred on the last epoch and is not too material but yes still need to profile the added computation, which runs every epoch.

Also perhaps figure out how to best integrate these new plots into the overall picture. We have 4 curves now (PR + 3 above), none of which are interactive unfortunately. One simple change might be in test.py to update the P and R printouts to display at max F1 confidence rather than at 0.1 confidence, and to add an output for F1.

@decent-engineer-decent-datascientist

Any reason we can’t use this:
https://wandb.ai/lavanyashukla/visualize-predictions/reports/Visualize-Model-Predictions--Vmlldzo1NjM4OA#Plots-2

Seems like an easy enough addition, and it’d let us play around with the data in wandb.

@glenn-jocher
Copy link
Member

I've merged #2057, all four plots will output by default now, and P and R metrics are now logged (and printed to screen) at the optimal F1 confidence during training (which I assume may vary over the training epochs).

@decent-engineer-decent-datascientist

Awesome, I’ll test it out now. Is there a reason you upload the results as an image rather than passing the plots? I believe passing the matplotlib figure will actually covert it into an interactive plotly plot on the wandb interface.

@glenn-jocher
Copy link
Member

Just lack of time. We'd want to transition the entire logging infrastructure (or all plots at least) to an interactive local logger-agnostic environment (i.e. local plotly/bokeh dashboard) first and then transition remote logging to view the same.

@glenn-jocher
Copy link
Member

@decent-engineer-decent-datascientist if you'd like to take a stab at this feel free to! The main precedent for passing logger sources through plotting functions is here.

A loggers dict is constructed here with room for future growth:

yolov5/train.py

Line 134 in f639e14

loggers = {'wandb': wandb} # loggers dict

which is then passed down to lower level plotting functions:

yolov5/train.py

Line 204 in f639e14

plot_labels(labels, save_dir, loggers)

and run at the end of a given plotting function:

yolov5/utils/plots.py

Lines 295 to 299 in f639e14

# loggers
for k, v in loggers.items() or {}:
if k == 'wandb' and v:
v.log({"Labels": [v.Image(str(x), caption=x.name) for x in save_dir.glob('*labels*.jpg')]}, commit=False)

Like I was saying though we'd want to provide a consistent cross-platform experience, so we'd want to begin the transition locally (i.e. through all plots in utils/plots.py) and then migrate those changes to wandb etc where possible.

@decent-engineer-decent-datascientist

@glenn-jocher ah, time. The things we could do without that constraint haha. I'll mess around with it though, thank you for the pointers!

@decent-engineer-decent-datascientist

@glenn-jocher also would you like me to close this issue now that we've got a nice f1 vs conf graph?

@glenn-jocher
Copy link
Member

Sure, sounds good!

Ah, issue was closed on linked PR merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants