[examples/summarization] deal with None in data records #14816

stas00 · 2021-12-17T19:22:59Z

When trying to use https://huggingface.co/datasets/wikihow with run_summarization.py I run into incomplete records in the manually downloaded dataset (the data is not on the hub and requires a user to download it manually):

  [...]
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1990, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "examples/pytorch/summarization/run_summarization.py", line 450, in preprocess_function
    inputs = [prefix + inp for inp in inputs]
  File "examples/pytorch/summarization/run_summarization.py", line 450, in <listcomp>
    inputs = [prefix + inp for inp in inputs]
TypeError: can only concatenate str (not "NoneType") to str

This PR is fixing that by filtering out incomplete records. Now it's possible to run:

python examples/pytorch/summarization/run_summarization.py --model_name_or_path \
google/pegasus-wikihow --max_source_length 512 --max_target_length 256 --do_eval \
--per_device_eval_batch_size 8 --predict_with_generate --num_beams 8 --overwrite_output_dir \
--output_dir output_dir --validation_file data/wikihowAll.csv --text_column text --summary_column \
headline --max_eval_samples 10

For context: I was trying to deal with this issue #14804 when I run into this problem. And this fix was needed for me to be able to reproduce the issue. In other words this wasn't me just randomly trying some random dataset for the heck of it, I was trying to deal with a bug report. And this dataset is not random, since we report a performance score on it for https://huggingface.co/google/pegasus-wikihow which was originally reported here: #6844 and if we can't use our own tools to reproduce a report made by us, then I don't know how to move forward here.

@sgugger

sgugger · 2021-12-17T19:30:14Z

As I said before, the examples are not supposed to work out of the box on every dataset and we shouldn't strive for that. Adding more complexity should be on the user's side when they want to deal with another dataset. Cf the second paragraph of the examples README:

While we strive to present as many use cases as possible, the scripts in this folder are just examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data. This way, you can easily tweak them.

cc @LysandreJik @patrickvonplaten @patil-suraj if you have a different opinion, we can evolve our philosophy.

stas00 · 2021-12-17T20:06:27Z

Oh, sorry, I think I have misinterpreted your comment on Slack. I thought you were agreeing that this fix should go in.

In this particular situation ideally datasets should import that data into its online storage for https://huggingface.co/datasets/wikihow (as it's not there at the moment) and it could do the fixing as part of the import script.

But on the other hand this is just defensive programming since the example code takes random csv files and can't expect them to be without problems. So I think as an example this is a good demonstration of data sanitizing, am I wrong?

I do hear you saying that every additional code makes the examples more complex. I'm not disagreeing with that.

sgugger · 2021-12-20T18:23:16Z

It's true that this one is borderline and generally useful, so I'm curious of other people's opinion.

patrickvonplaten · 2021-12-20T19:02:38Z

Don't have a strong opinion, but I'm more in favor of it than against it. It's quite easy to understand as a reader what's going on there IMO. I would slightly favor to not use map(...) and zip(...) however, but rather just two list comprehensions [x for x in batch[...] if x[...] is not None] for improved readability.

stas00 · 2021-12-20T19:39:55Z

I would slightly favor to not use map(...) and zip(...) however, but rather just two list comprehensions [x for x in batch[...] if x[...] is not None] for improved readability.

The reason it's complex is because unless I misunderstood your proposal I don't think it would work. This is because we have 2 parallel arrays thus you need to filter them together to keep the alignment between the pairs.

Here is a sample code to understand what's going on:

x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = map(list, zip(*([x["a"][i], x["b"][i]] for i in range(len(x["a"])) if x["a"][i] is not None and x["b"][i] is not None)))

If you filter them out separately you will end up with mismatching pairs.

Here is a simpler to understand version, but it's slower of course.

x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = [], []
for i in range(len(x["a"])):
    if x["a"][i] is not None and x["b"][i] is not None:
        a.append(x["a"][i])
        b.append(x["b"][i])

But by all means I'd be happy to use a simpler code if you can think of one.

patrickvonplaten · 2021-12-20T22:28:11Z

I would slightly favor to not use map(...) and zip(...) however, but rather just two list comprehensions [x for x in batch[...] if x[...] is not None] for improved readability.

The reason it's complex is because unless I misunderstood your proposal I don't think it would work. This is because we have 2 parallel arrays thus you need to filter them together to keep the alignment between the pairs.

Here is a sample code to understand what's going on:
x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = map(list, zip(*([x["a"][i], x["b"][i]] for i in range(len(x["a"])) if x["a"][i] is not None and x["b"][i] is not None))) 
If you filter them out separately you will end up with mismatching pairs.

Here is a simpler to understand version, but it's slower of course.
x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = [], []
for i in range(len(x["a"])):
    if x["a"][i] is not None and x["b"][i] is not None:
        a.append(x["a"][i])
        b.append(x["b"][i])
But by all means I'd be happy to use a simpler code if you can think of one.

Ah I see - thanks for explaining it in more detail! Your proposal

x = {"a":[1,None,3,4], "b":[5,6,None,7]}
a, b = [], []
for i in range(len(x["a"])):
    if x["a"][i] is not None and x["b"][i] is not None:
        a.append(x["a"][i])
        b.append(x["b"][i])

looks very nice. I don't think speed is really relevant here

stas00 · 2021-12-21T03:37:38Z

Pushed the slower, but easier to read version as suggested by Patrick.

patrickvonplaten · 2021-12-21T20:43:52Z

Thanks a lot!

[examples/summarization] deal with None in data records

42b43be

rewrite to use a simpler (slower) variant

9be0e83

sgugger approved these changes Dec 21, 2021

View reviewed changes

stas00 merged commit 033c3ed into master Dec 21, 2021

stas00 deleted the examples-better-preprocess branch December 21, 2021 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[examples/summarization] deal with None in data records #14816

[examples/summarization] deal with None in data records #14816

stas00 commented Dec 17, 2021 •

edited

Loading

sgugger commented Dec 17, 2021

stas00 commented Dec 17, 2021 •

edited

Loading

sgugger commented Dec 20, 2021

patrickvonplaten commented Dec 20, 2021

stas00 commented Dec 20, 2021 •

edited

Loading

patrickvonplaten commented Dec 20, 2021

stas00 commented Dec 21, 2021

patrickvonplaten commented Dec 21, 2021

[examples/summarization] deal with None in data records #14816

[examples/summarization] deal with None in data records #14816

Conversation

stas00 commented Dec 17, 2021 • edited Loading

sgugger commented Dec 17, 2021

stas00 commented Dec 17, 2021 • edited Loading

sgugger commented Dec 20, 2021

patrickvonplaten commented Dec 20, 2021

stas00 commented Dec 20, 2021 • edited Loading

patrickvonplaten commented Dec 20, 2021

stas00 commented Dec 21, 2021

patrickvonplaten commented Dec 21, 2021

stas00 commented Dec 17, 2021 •

edited

Loading

stas00 commented Dec 17, 2021 •

edited

Loading

stas00 commented Dec 20, 2021 •

edited

Loading