-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[examples/summarization] deal with None in data records #14816
Conversation
As I said before, the examples are not supposed to work out of the box on every dataset and we shouldn't strive for that. Adding more complexity should be on the user's side when they want to deal with another dataset. Cf the second paragraph of the examples README:
cc @LysandreJik @patrickvonplaten @patil-suraj if you have a different opinion, we can evolve our philosophy. |
Oh, sorry, I think I have misinterpreted your comment on Slack. I thought you were agreeing that this fix should go in. In this particular situation ideally But on the other hand this is just defensive programming since the example code takes random csv files and can't expect them to be without problems. So I think as an example this is a good demonstration of data sanitizing, am I wrong? I do hear you saying that every additional code makes the examples more complex. I'm not disagreeing with that. |
It's true that this one is borderline and generally useful, so I'm curious of other people's opinion. |
Don't have a strong opinion, but I'm more in favor of it than against it. It's quite easy to understand as a reader what's going on there IMO. I would slightly favor to not use |
The reason it's complex is because unless I misunderstood your proposal I don't think it would work. This is because we have 2 parallel arrays thus you need to filter them together to keep the alignment between the pairs. Here is a sample code to understand what's going on:
If you filter them out separately you will end up with mismatching pairs. Here is a simpler to understand version, but it's slower of course.
But by all means I'd be happy to use a simpler code if you can think of one. |
Ah I see - thanks for explaining it in more detail! Your proposal x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = [], []
for i in range(len(x["a"])):
if x["a"][i] is not None and x["b"][i] is not None:
a.append(x["a"][i])
b.append(x["b"][i]) looks very nice. I don't think speed is really relevant here |
Pushed the slower, but easier to read version as suggested by Patrick. |
Thanks a lot! |
When trying to use https://huggingface.co/datasets/wikihow with
run_summarization.py
I run into incomplete records in the manually downloaded dataset (the data is not on the hub and requires a user to download it manually):This PR is fixing that by filtering out incomplete records. Now it's possible to run:
For context: I was trying to deal with this issue #14804 when I run into this problem. And this fix was needed for me to be able to reproduce the issue. In other words this wasn't me just randomly trying some random dataset for the heck of it, I was trying to deal with a bug report. And this dataset is not random, since we report a performance score on it for https://huggingface.co/google/pegasus-wikihow which was originally reported here: #6844 and if we can't use our own tools to reproduce a report made by us, then I don't know how to move forward here.
@sgugger