Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[examples/summarization] deal with None in data records #14816

Merged
merged 2 commits into from
Dec 21, 2021

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Dec 17, 2021

When trying to use https://huggingface.co/datasets/wikihow with run_summarization.py I run into incomplete records in the manually downloaded dataset (the data is not on the hub and requires a user to download it manually):

  [...]
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1990, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "examples/pytorch/summarization/run_summarization.py", line 450, in preprocess_function
    inputs = [prefix + inp for inp in inputs]
  File "examples/pytorch/summarization/run_summarization.py", line 450, in <listcomp>
    inputs = [prefix + inp for inp in inputs]
TypeError: can only concatenate str (not "NoneType") to str

This PR is fixing that by filtering out incomplete records. Now it's possible to run:

python examples/pytorch/summarization/run_summarization.py --model_name_or_path \
google/pegasus-wikihow --max_source_length 512 --max_target_length 256 --do_eval \
--per_device_eval_batch_size 8 --predict_with_generate --num_beams 8 --overwrite_output_dir \
--output_dir output_dir --validation_file data/wikihowAll.csv --text_column text --summary_column \
headline --max_eval_samples 10

For context: I was trying to deal with this issue #14804 when I run into this problem. And this fix was needed for me to be able to reproduce the issue. In other words this wasn't me just randomly trying some random dataset for the heck of it, I was trying to deal with a bug report. And this dataset is not random, since we report a performance score on it for https://huggingface.co/google/pegasus-wikihow which was originally reported here: #6844 and if we can't use our own tools to reproduce a report made by us, then I don't know how to move forward here.

@sgugger

@sgugger
Copy link
Collaborator

sgugger commented Dec 17, 2021

As I said before, the examples are not supposed to work out of the box on every dataset and we shouldn't strive for that. Adding more complexity should be on the user's side when they want to deal with another dataset. Cf the second paragraph of the examples README:

While we strive to present as many use cases as possible, the scripts in this folder are just examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data. This way, you can easily tweak them.

cc @LysandreJik @patrickvonplaten @patil-suraj if you have a different opinion, we can evolve our philosophy.

@stas00
Copy link
Contributor Author

stas00 commented Dec 17, 2021

Oh, sorry, I think I have misinterpreted your comment on Slack. I thought you were agreeing that this fix should go in.

In this particular situation ideally datasets should import that data into its online storage for https://huggingface.co/datasets/wikihow (as it's not there at the moment) and it could do the fixing as part of the import script.

But on the other hand this is just defensive programming since the example code takes random csv files and can't expect them to be without problems. So I think as an example this is a good demonstration of data sanitizing, am I wrong?

I do hear you saying that every additional code makes the examples more complex. I'm not disagreeing with that.

@sgugger
Copy link
Collaborator

sgugger commented Dec 20, 2021

It's true that this one is borderline and generally useful, so I'm curious of other people's opinion.

@patrickvonplaten
Copy link
Contributor

Don't have a strong opinion, but I'm more in favor of it than against it. It's quite easy to understand as a reader what's going on there IMO. I would slightly favor to not use map(...) and zip(...) however, but rather just two list comprehensions [x for x in batch[...] if x[...] is not None] for improved readability.

@stas00
Copy link
Contributor Author

stas00 commented Dec 20, 2021

I would slightly favor to not use map(...) and zip(...) however, but rather just two list comprehensions [x for x in batch[...] if x[...] is not None] for improved readability.

The reason it's complex is because unless I misunderstood your proposal I don't think it would work. This is because we have 2 parallel arrays thus you need to filter them together to keep the alignment between the pairs.

Here is a sample code to understand what's going on:

x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = map(list, zip(*([x["a"][i], x["b"][i]] for i in range(len(x["a"])) if x["a"][i] is not None and x["b"][i] is not None))) 

If you filter them out separately you will end up with mismatching pairs.

Here is a simpler to understand version, but it's slower of course.

x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = [], []
for i in range(len(x["a"])):
    if x["a"][i] is not None and x["b"][i] is not None:
        a.append(x["a"][i])
        b.append(x["b"][i])

But by all means I'd be happy to use a simpler code if you can think of one.

@patrickvonplaten
Copy link
Contributor

I would slightly favor to not use map(...) and zip(...) however, but rather just two list comprehensions [x for x in batch[...] if x[...] is not None] for improved readability.

The reason it's complex is because unless I misunderstood your proposal I don't think it would work. This is because we have 2 parallel arrays thus you need to filter them together to keep the alignment between the pairs.

Here is a sample code to understand what's going on:

x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = map(list, zip(*([x["a"][i], x["b"][i]] for i in range(len(x["a"])) if x["a"][i] is not None and x["b"][i] is not None))) 

If you filter them out separately you will end up with mismatching pairs.

Here is a simpler to understand version, but it's slower of course.

x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = [], []
for i in range(len(x["a"])):
    if x["a"][i] is not None and x["b"][i] is not None:
        a.append(x["a"][i])
        b.append(x["b"][i])

But by all means I'd be happy to use a simpler code if you can think of one.

Ah I see - thanks for explaining it in more detail! Your proposal

x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = [], []
for i in range(len(x["a"])):
    if x["a"][i] is not None and x["b"][i] is not None:
        a.append(x["a"][i])
        b.append(x["b"][i])

looks very nice. I don't think speed is really relevant here

@stas00
Copy link
Contributor Author

stas00 commented Dec 21, 2021

Pushed the slower, but easier to read version as suggested by Patrick.

@stas00 stas00 merged commit 033c3ed into master Dec 21, 2021
@stas00 stas00 deleted the examples-better-preprocess branch December 21, 2021 17:17
@patrickvonplaten
Copy link
Contributor

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants