Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding train-roto-ptrs.txt file #26

Closed
wingedRuslan opened this issue Jan 27, 2021 · 5 comments
Closed

Understanding train-roto-ptrs.txt file #26

wingedRuslan opened this issue Jan 27, 2021 · 5 comments

Comments

@wingedRuslan
Copy link

Hi Ratish,

Thanks a lot for the insightful research paper as well as making the codebase publicly available! I was able to train the model on boxscore-data and then used it for inference. Now I am interested in training your model on my dataset and then use it for text generation.

Unfortunately, I encountered a problem with the following step (from the README page):

The train-roto-ptrs.txt file is available along with the dataset and can also be created by the following command:

python data_utils.py -mode ptrs -input_path $BASE/rotowire/train.json -train_content_plan $BASE/rotowire/inter/train_content_plan.txt -output_fi $BASE/rotowire/train-roto-ptrs.txt

Since my dataset is different from boxscore-data, I basically have to perform transformations manually to my dataset in order to have it in a suitable format for model training.

  1. "The input dataset for data2text-plan-py can be created by running the script create_dataset.py in scripts folder."

I have successfully prepared my dataset in the same format as files in boxscore-data/rotowire/, namely, the following files: src_train.txt, train_content_plan.txt, tgt_train.txt, inter/train_content_plan.txt and src_valid.txt, tgt_valid.txt, valid_content_plan.txt, inter/valid_content_plan.txt and test/src_test.txt , test/tgt_test.txt. The structure of these files is the same as in files obtained from boxscore-data

  1. [Preprocessing] I could not run preprocess.py script, since to run this script I need train-roto-ptrs.txt file.

The function for creating this file is:

def make_pointerfi(outfi, inp_file="rotowire/train.json", content_plan_inp="inter/train_content_plan", resolve_prons=False):

Because I have another dataset, I can not use the above-mentioned function, thus I need to create train-roto-ptrs.txt file by myself. Unfortunately, going multiple times through the function implementation and analyzing the content of a file (please see comments), I could not figure out how to create such file from my dataset.

Can you please elaborate on the purposes of the train-roto-ptrs.txt file and briefly describe the steps on how it was created?

This is my current bottleneck and I would highly appreciate your help with this issue!

Many thanks in advance,
Ruslan

@wingedRuslan
Copy link
Author

Analyzing the content of a file

Looking into the content of the original train-roto-ptrs.txt file did not help to understand how the file was created.

image

From my perspective they are not related to any other information from other files:

image

image

@ratishsp, Could you please help me to figure out how the train-roto-ptrs.txt file was created?

@ratishsp
Copy link
Owner

Hi Ruslan,
Glad to know that you find the paper and code useful!
The logic for train-roto-ptrs.txt is in method

def make_pointerfi(outfi, inp_file="rotowire/train.json", content_plan_inp="inter/train_content_plan", resolve_prons=False):

The core idea is to provide supervision while training the copy mechanism.
switch_loss = self.switch_loss_criterion(p_copy, align.ne(0).float().view(-1, 1))

The entries in the train_roto_ptrs.txt contain mapping between the summary token and the corresponding matching token in the content plan. For eg: the last entry 245,39 in train_roto_ptrs[1] indicates that the 245th token in summary matches with 39th content plan entry.

Having said that, in later experiments I have realized that such supervision is not strictly required. The model learns an accurate value of p_copy even without such a supervision.
Hence I would recommend that you could comment out any code which uses the supervision through pointers and the model should work well enough.

@wingedRuslan
Copy link
Author

Hi Ratish,

Thanks a lot for your prompt answer!

It was such a useful and valuable comment, that I was able to create same file but now for other dataset!

The entries in the train_roto_ptrs.txt contain mapping between the summary token and the corresponding matching token in the content plan. For eg: the last entry 245,39 in train_roto_ptrs[1] indicates that the 245th token in summary matches with 39th content plan entry.

At this moment I did not try to comment out any code which uses the supervision through pointers, since I want to keep things just the way you did! Later I plan to run such an experiment and see whether such supervision is required or not.

@ratishsp
Copy link
Owner

ratishsp commented Feb 3, 2021

Great!
All the best.

@happycjksh
Copy link

When I drive the utils_data.py, the terminal output that the index out of range in the line 593. Do you have the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants