Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setup.sh fails when untarring cnn_stories.tgz #2

Open
JohnGiorgi opened this issue Jul 17, 2022 · 5 comments
Open

setup.sh fails when untarring cnn_stories.tgz #2

JohnGiorgi opened this issue Jul 17, 2022 · 5 comments

Comments

@JohnGiorgi
Copy link

Hi,

I am trying to re-create the data under the data subdirectory by following the instructions. With a docker deamon running, I setup as follows:

# Set up the environment
pyenv install miniconda3-3.7-4.12.0
conda create -n re-examining
conda activate re-examining

# Install re-examining
git clone https://github.com/CogComp/re-examining-correlations.git
cd re-examining-correlations
pip install -r requirements.txt
pip install sacrerouge
pip install repro

and then run

sh data/setup.sh

This process runs until it reaches the following step:

Untarring temp/summeval/raw/cnn_stories.tgz (it's pretty slow...)

at which point it errors out:

Traceback (most recent call last):
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/bin/sacrerouge", line 8, in <module>
    sys.exit(main())
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/__main__.py", line 8, in main
    args.func(args)
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/commands/setup_dataset.py", line 25, in run
    args.subfunc(args)
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/datasets/fabbri2020/subcommand.py", line 28, in run
    setup.setup(args.output_dir, args.force)
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/datasets/fabbri2020/setup.py", line 349, in setup
    setup_documents(cnn_tar, dailymail_tar, output_dir, force)
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/datasets/fabbri2020/setup.py", line 130, in setup_documents
    with tarfile.open(tar_path, 'r') as tar:
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/tarfile.py", line 1580, in open
    raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully

Any ideas why the untarring cnn_stories would fail? The issue appears to originate in sacrerouge.

@JohnGiorgi JohnGiorgi changed the title sh data/setup.sh fails when untarring temp/summeval/raw/cnn_stories.tgz setup.sh fails when untarring nn_stories.tgz Jul 17, 2022
@JohnGiorgi JohnGiorgi changed the title setup.sh fails when untarring nn_stories.tgz setup.sh fails when untarring cnn_stories.tgz Jul 17, 2022
@danieldeutsch
Copy link
Collaborator

Could you check to make sure that the cnn_stories file successfully downloaded? It looks like there's a problem opening it, so my guess is something is wrong with the file.

@JohnGiorgi
Copy link
Author

JohnGiorgi commented Jul 18, 2022

I tried re-downloading cnn_stories.tgz by deleting temp/* and running sh data/setup.sh again, but I get the same error.

If I manually inspect the contents of the file, I can see it only managed to grab some HTML:

<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="TOeDQ4nep9OjaIugDC8ZVg">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ">cnn_stories.tgz</a> (151M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://docs.google.com/uc?export=download&amp;id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&amp;confirm=t&amp;uuid=656a37f4-386b-46c5-ab03-a42da6c349fa" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>

I don't know how to avoid this? Looks like sacrerouge.common.util.download_file_from_google_drive function is not working as expected. I see similar HTML content for dailymail_stories.tgz.

@danieldeutsch
Copy link
Collaborator

Ok, I have seen this problem before. The library I use to download from Google Drive is not reliable.

Until I have time to fix it, a workaround is to download the respective files from here and put them in the temp/summeval/raw/ directory. I believe the downloading code should see the files already exist and skip trying to download them.

@JohnGiorgi
Copy link
Author

JohnGiorgi commented Jul 19, 2022

Gotcha, this gets me past the first Google Drive-related hurtle, thanks! Unfortunately the GDrive requests then start getting blocked

AssertionError: pnbert_out_lstm_pn_rl has unequal lines in its src, ref, and out files, likely because Google Drive began denying requests. Delete the bad files and rerun.

If I manually inspect the files, I can see that is the case:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>

Any idea how long to wait between runs before GDrive allows these requests to go through again?

@danieldeutsch
Copy link
Collaborator

Hopefully you were able to fix this 😬 . Downloading files from Google Drive has always been a pain point, and I don't know of any solutions to more reliably download files. One day I will get around to fixing it....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants