Understanding train-roto-ptrs.txt file #26

wingedRuslan · 2021-01-27T19:22:29Z

Hi Ratish,

Thanks a lot for the insightful research paper as well as making the codebase publicly available! I was able to train the model on boxscore-data and then used it for inference. Now I am interested in training your model on my dataset and then use it for text generation.

Unfortunately, I encountered a problem with the following step (from the README page):

The train-roto-ptrs.txt file is available along with the dataset and can also be created by the following command:

python data_utils.py -mode ptrs -input_path $BASE/rotowire/train.json -train_content_plan $BASE/rotowire/inter/train_content_plan.txt -output_fi $BASE/rotowire/train-roto-ptrs.txt

Since my dataset is different from boxscore-data, I basically have to perform transformations manually to my dataset in order to have it in a suitable format for model training.

"The input dataset for data2text-plan-py can be created by running the script create_dataset.py in scripts folder."

I have successfully prepared my dataset in the same format as files in boxscore-data/rotowire/, namely, the following files: src_train.txt, train_content_plan.txt, tgt_train.txt, inter/train_content_plan.txt and src_valid.txt, tgt_valid.txt, valid_content_plan.txt, inter/valid_content_plan.txt and test/src_test.txt , test/tgt_test.txt. The structure of these files is the same as in files obtained from boxscore-data

[Preprocessing] I could not run preprocess.py script, since to run this script I need train-roto-ptrs.txt file.

The function for creating this file is:

data2text-plan-py/data_utils.py

Line 574 in 4b74535

    
           def make_pointerfi(outfi, inp_file="rotowire/train.json", content_plan_inp="inter/train_content_plan", resolve_prons=False):

Because I have another dataset, I can not use the above-mentioned function, thus I need to create train-roto-ptrs.txt file by myself. Unfortunately, going multiple times through the function implementation and analyzing the content of a file (please see comments), I could not figure out how to create such file from my dataset.

Can you please elaborate on the purposes of the train-roto-ptrs.txt file and briefly describe the steps on how it was created?

This is my current bottleneck and I would highly appreciate your help with this issue!

Many thanks in advance,
Ruslan

wingedRuslan · 2021-01-27T19:37:36Z

Analyzing the content of a file

Looking into the content of the original train-roto-ptrs.txt file did not help to understand how the file was created.

From my perspective they are not related to any other information from other files:

@ratishsp, Could you please help me to figure out how the train-roto-ptrs.txt file was created?

ratishsp · 2021-01-28T12:53:50Z

Hi Ruslan,
Glad to know that you find the paper and code useful!
The logic for train-roto-ptrs.txt is in method

data2text-plan-py/data_utils.py

Line 574 in 4b74535

    
           def make_pointerfi(outfi, inp_file="rotowire/train.json", content_plan_inp="inter/train_content_plan", resolve_prons=False):

The core idea is to provide supervision while training the copy mechanism.

data2text-plan-py/onmt/modules/CopyGenerator.py

Line 199 in 4b74535

    
           switch_loss = self.switch_loss_criterion(p_copy, align.ne(0).float().view(-1, 1))

The entries in the train_roto_ptrs.txt contain mapping between the summary token and the corresponding matching token in the content plan. For eg: the last entry 245,39 in train_roto_ptrs[1] indicates that the 245th token in summary matches with 39th content plan entry.

Having said that, in later experiments I have realized that such supervision is not strictly required. The model learns an accurate value of p_copy even without such a supervision.
Hence I would recommend that you could comment out any code which uses the supervision through pointers and the model should work well enough.

wingedRuslan · 2021-02-02T21:49:22Z

Hi Ratish,

Thanks a lot for your prompt answer!

It was such a useful and valuable comment, that I was able to create same file but now for other dataset!

The entries in the train_roto_ptrs.txt contain mapping between the summary token and the corresponding matching token in the content plan. For eg: the last entry 245,39 in train_roto_ptrs[1] indicates that the 245th token in summary matches with 39th content plan entry.

At this moment I did not try to comment out any code which uses the supervision through pointers, since I want to keep things just the way you did! Later I plan to run such an experiment and see whether such supervision is required or not.

ratishsp · 2021-02-03T09:29:02Z

Great!
All the best.

happycjksh · 2021-07-01T15:13:42Z

When I drive the utils_data.py, the terminal output that the index out of range in the line 593. Do you have the problem?

ratishsp closed this as completed Feb 3, 2021

linkAmy mentioned this issue Feb 5, 2021

Train failed: inconsistent sequence length #29

Closed

ratishsp mentioned this issue May 9, 2021

data_utils.py List index out of range #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding train-roto-ptrs.txt file #26

Understanding train-roto-ptrs.txt file #26

wingedRuslan commented Jan 27, 2021

wingedRuslan commented Jan 27, 2021

ratishsp commented Jan 28, 2021

wingedRuslan commented Feb 2, 2021

ratishsp commented Feb 3, 2021

happycjksh commented Jul 1, 2021

Understanding train-roto-ptrs.txt file #26

Understanding train-roto-ptrs.txt file #26

Comments

wingedRuslan commented Jan 27, 2021

wingedRuslan commented Jan 27, 2021

Analyzing the content of a file

ratishsp commented Jan 28, 2021

wingedRuslan commented Feb 2, 2021

ratishsp commented Feb 3, 2021

happycjksh commented Jul 1, 2021