-
-
Notifications
You must be signed in to change notification settings - Fork 16.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output optimal confidence threshold based on PR curve #2048
Comments
@decent-engineer-decent-datascientist yes that's an interesting idea. Unlike P and R, F1 should have a stable max point (P and R have unstable max and minima at 0.0 and 1.0 confidence thresholds). I'm not sure if the currently saved metrics suffice for this sort of analysis. At the moment we only save metrics at a fixed confidence, given by --conf for mAP and 0.1 hard coded for P and R (and also F1): Line 41 in d68afed
|
@decent-engineer-decent-datascientist evaluations at different PR scores might also help contribute to an additional feature, which would be an interactive P-R curve (1.0 conf top left and 0.0 conf bottom right). |
Oh yes I like this idea. I think it would also be interesting to vary the IoU threshold too and find the best settings to maximize the F1 score. |
@Ownmarc love the expansion to IOU and best.pt. Definitely something I could use. |
@decent-engineer-decent-datascientist @Ownmarc I reviewed the code and this seems feasible, though probably best implemented at a single IoU threshold (i.e. 0.5 only), rather than all 10 IoU thresholds (0.50 : 0.05 : 0.95). The current implementation computes P and R at a fixed score/conf (0.1) only at 0.50 IoU threshold. But luckily the operation that does this is vectorize-able so I can extend the P and R op to a vector of confidences, say np.linspace(0, 1, 100). This should allow for some really cool P, R and F1 plots as a function of confidence. |
About best.pt, these checkpoints are saved anytime a new best fitness is observed, with default fitness defined as: Lines 12 to 16 in f59f801
In the past we used to define fitness as inverse loss, though interestingly the min loss checkpoint rarely coincided with the highest mAP checkpoint, so we switched it to the current scheme following user feedback. It's possible that an array of 'best' checkpoints might better serve the varying community needs i.e.: weights/
last.pt
best_map.pt
best_f1.pt # powered by our new all-confidence measurement
best_loss.pt |
@Ownmarc @decent-engineer-decent-datascientist I've opened up PR #2057 on this topic. YOLOv5m on COCO initial results below, evaluated and plotted at 0.50 IoU: F1 CurvePrecision CurveRecall Curve |
Oh gosh, that's so pretty. Do you have any worries merging this with master (performance, etc.)? The checks look good and no conflicts. |
@decent-engineer-decent-datascientist the plotting time is only incurred on the last epoch and is not too material but yes still need to profile the added computation, which runs every epoch. Also perhaps figure out how to best integrate these new plots into the overall picture. We have 4 curves now (PR + 3 above), none of which are interactive unfortunately. One simple change might be in test.py to update the P and R printouts to display at max F1 confidence rather than at 0.1 confidence, and to add an output for F1. |
Any reason we can’t use this: Seems like an easy enough addition, and it’d let us play around with the data in wandb. |
I've merged #2057, all four plots will output by default now, and P and R metrics are now logged (and printed to screen) at the optimal F1 confidence during training (which I assume may vary over the training epochs). |
Awesome, I’ll test it out now. Is there a reason you upload the results as an image rather than passing the plots? I believe passing the matplotlib figure will actually covert it into an interactive plotly plot on the wandb interface. |
Just lack of time. We'd want to transition the entire logging infrastructure (or all plots at least) to an interactive local logger-agnostic environment (i.e. local plotly/bokeh dashboard) first and then transition remote logging to view the same. |
@decent-engineer-decent-datascientist if you'd like to take a stab at this feel free to! The main precedent for passing logger sources through plotting functions is here. A loggers dict is constructed here with room for future growth: Line 134 in f639e14
which is then passed down to lower level plotting functions: Line 204 in f639e14
and run at the end of a given plotting function: Lines 295 to 299 in f639e14
Like I was saying though we'd want to provide a consistent cross-platform experience, so we'd want to begin the transition locally (i.e. through all plots in utils/plots.py) and then migrate those changes to wandb etc where possible. |
@glenn-jocher ah, time. The things we could do without that constraint haha. I'll mess around with it though, thank you for the pointers! |
@glenn-jocher also would you like me to close this issue now that we've got a nice f1 vs conf graph? |
Sure, sounds good! Ah, issue was closed on linked PR merge. |
🚀 Feature
Could we print out the maximum F1 score and the associated confidence threshold at the end of training?
Motivation
This is something that should be done for every custom model, and the data is readily available after the final mAP calculations, and PR curve drawing.
Pitch
Filter detections at different score/confident thresholds, calculate P/R/F1, and then print the optimal threshold (max f1).
Alternatives
Instead of printing max PR, maybe write a csv in the run directory, containing metrics at different thresholds.
The text was updated successfully, but these errors were encountered: