Cannot replicate results from object detection task guide #30557

adam-homeboost · 2024-04-29T22:00:53Z

System Info

transformers version: 4.40.1
Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
Python version: 3.10.14
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.3
Accelerate version: 0.29.3
Accelerate config: not found
PyTorch version (GPU?): 2.2.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

I am following the examples given in https://huggingface.co/docs/transformers/en/tasks/object_detection

I am following it as closely as I possibly can. The only difference is that I am not pushing the training results up to hugging face and am instead saving (and reloading them) locally.

When I run the evaluation I get terrible results that look nothing like what the examples do. Instead of mAPs in the 0.3 - 0.7 range, I am getting results well under 0.1.

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.053
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.096
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.044
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.039
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.018
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.046
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.078
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.214
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.281
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.128
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.157
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.263

Instead of the expected:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.352
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.681
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.292
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.168
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.208
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.429
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.274
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.484
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.501
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.323
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590

This much difference makes for a "tuned" model that does not work at all.

If I use the tuned dataset on huggingface as documented (uploaded by the author?) then I get expected results. For some reason, following the same steps, I cannot get to a tuned model with anywhere near the performance the author did.

I have tried this on different hardware and gotten different results as well. The above numbers are from nvidia gpus. I tried this on mac m3 gpus and got even worse numbers (all less than 0.001).

I am new to machine learning and this toolset. I would appreciate any suggestions or guidelines as to what I could be doing wrong here. I also do not understand why running this on different hardware and different cpu vs gpu mixes ends up with different scores.

Any help or suggestions appreciated!

Who can help?

@amyeroberts

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Follow task code exactly, except:

After trainer.train(), do trainer.save_model()
in eval steps, instantiate AutoImageProcessor and AutoModelForObjectDetection from_pretrained from the saved model in step 1.

Expected behavior

Expect same or reasonably close scores in the eval step.

The text was updated successfully, but these errors were encountered:

NielsRogge · 2024-04-30T07:05:38Z

Relevant to #29964 and cc @qubvel

qubvel · 2024-04-30T12:00:12Z

Hi @adam-homeboost, thanks for reporting the problem! I am working on refining object detection examples.
You could take a look at #30422. In this PR new examples are going to be provided which gives even better metrics.

adam-homeboost · 2024-04-30T18:18:46Z

Thank you @qubvel . I definitely appreciate the work you are doing to update the example!

Based on your comments, I ran the original example's training out to 100 epochs instead of the documented 10 and got much better results that matched close enough. So, definitely a documentation issue there. I see that your new example properly show the correct number of epochs.

As a beginner in ML, can I offer a couple of suggestions for your new examples? These are questions I have:

It is unclear to me how you decide what specific image transformations are needed. Although I have poked into the original documentation/code on the detr model you are using, I didn't find anything as clear or obvious as the transformer you set up. I think it would be helpful to have a note on methodology there, especially for those cases where a different model is being adapted. What should people look for when picking a different base model?
I see your two examples of trainer vs not trainer. It would be helpful to understand why someone would choose to forego the trainer. What is the tradeoff there?

g1y5x3 · 2024-04-30T20:00:06Z

It is unclear to me how you decide what specific image transformations are needed. Although I have poked into the original documentation/code on the detr model you are using, I didn't find anything as clear or obvious as the transformer you set up. I think it would be helpful to have a note on methodology there, especially for those cases where a different model is being adapted. What should people look for when picking a different base model?

If you are wondering about detr architecture, it should be the same as here and you can find the paper in the documentation page as well.

I see your two examples of trainer vs not trainer. It would be helpful to understand why someone would choose to forego the trainer. What is the tradeoff there?

My understanding is Trainer is a much more flexible API that can enable you to switch between models very quickly for rapid development. The no-trainer leverages the accelerate API which allows you to perform multigpu training to speed up the process once you have decided which model you want to invest a bit more. And it requires a few more configurations in the code as you can probably tell.

qubvel · 2024-04-30T21:33:14Z

@adam-homeboost great questions, it will help to improve examples!

@g1y5x3 thanks for the answers, I will add more here:

It is unclear to me how you decide what specific image transformations are needed. Although I have poked into the original documentation/code on the detr model you are using, I didn't find anything as clear or obvious as the transformer you set up. I think it would be helpful to have a note on methodology there, especially for those cases where a different model is being adapted. What should people look for when picking a different base model?

We might consider two types of transformations:

Image "standardization". It is usually resizing[+padding] the image to some particular defined size + normalization. The resizing strategy could be dataset-specific and may depend on image size, aspect ratios, and size of objects. Normalization is usually taken the same as in the original pretrained model (very often imagenet_mean, imagenet_std).
Image augmentations. It is a way to enlarge dataset variability. Also could be very dataset-specific, and should be chosen based on metrics. As a starting point, you could choose some that do not significantly change data distribution (image/annotations are not looking "strange" after applying augmentations for this specific task). Here is a space where you can try different augmentations, change their parameters and inspect the result.

I see your two examples of trainer vs not trainer. It would be helpful to understand why someone would choose to forego the trainer. What is the tradeoff there?

I have a bit different opinion rather than @g1y5x3. Both APIs support multigpu training. The Trainer provides you with a simple API, it has a training/evaluation loop already implemented, but its less flexible. Despite Trainer having a large number of training arguments at some point you might want to implement some custom logic (e.g. custom learning rate scheduling or data sampling) and you will need access to the training/evaluation loop. For that case accelerate example is provided, which has explicit training and evaluation loops.

Please, let me know if it becomes a bit clearer and if you have any follow-up questions 🤗

g1y5x3 · 2024-05-01T03:50:28Z

Thank you for the clarification. It makes total sense. Quick question, after looking through your PR, it didn't touch the example. However, I remember that this needs a bit clarification as training that dataset for 10 epochs won't yield any good predictions, I tried to run it with 100 epochs, would take ~3 hours on a A6000. Does it need to be updated?

qubvel · 2024-05-01T09:23:11Z

@g1y5x3 yes, you are right, it needs to be updated. Any help is appreciated, in case you want to contribute, you can open a PR that aligns notebook example with python one and ping me for any help or review 🙂

NielsRogge · 2024-05-01T12:42:18Z

That's what my PR #29967 addresses, I'd prefer to have it merged to unblock people.

qubvel · 2024-05-01T13:14:59Z

Yes, let's have it merged. And the next PR can address other points in a notebook example from your issue (taking as a base #30422)

github-actions · 2024-05-30T08:03:19Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts added Examples Which is related to examples in general Vision labels Apr 30, 2024

NielsRogge mentioned this issue May 1, 2024

Improve object detection task guideline #29967

Merged

qubvel mentioned this issue May 6, 2024

Update object detection guide #30683

Merged

4 tasks

github-actions bot closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot replicate results from object detection task guide #30557

Cannot replicate results from object detection task guide #30557

adam-homeboost commented Apr 29, 2024

NielsRogge commented Apr 30, 2024

qubvel commented Apr 30, 2024

adam-homeboost commented Apr 30, 2024

g1y5x3 commented Apr 30, 2024

qubvel commented Apr 30, 2024

g1y5x3 commented May 1, 2024 •

edited

Loading

qubvel commented May 1, 2024 •

edited

Loading

NielsRogge commented May 1, 2024

qubvel commented May 1, 2024

github-actions bot commented May 30, 2024

Cannot replicate results from object detection task guide #30557

Cannot replicate results from object detection task guide #30557

Comments

adam-homeboost commented Apr 29, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

NielsRogge commented Apr 30, 2024

qubvel commented Apr 30, 2024

adam-homeboost commented Apr 30, 2024

g1y5x3 commented Apr 30, 2024

qubvel commented Apr 30, 2024

g1y5x3 commented May 1, 2024 • edited Loading

qubvel commented May 1, 2024 • edited Loading

NielsRogge commented May 1, 2024

qubvel commented May 1, 2024

github-actions bot commented May 30, 2024

g1y5x3 commented May 1, 2024 •

edited

Loading

qubvel commented May 1, 2024 •

edited

Loading