Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading IMDB dataset for benchmarks gives 404 Not Found #13896

Closed
alihan-synnada opened this issue Dec 24, 2024 · 11 comments · Fixed by #13903
Closed

Downloading IMDB dataset for benchmarks gives 404 Not Found #13896

alihan-synnada opened this issue Dec 24, 2024 · 11 comments · Fixed by #13903
Labels
bug Something isn't working

Comments

@alihan-synnada
Copy link
Contributor

Describe the bug

Attempting to download the IMDB dataset gives the following error:

tar: Error opening archive: Unrecognized archive format

An IMDB.tgz is created with the following content:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>

It seems the dataset is removed or unavailable.

To Reproduce

Run benchmarks/bench.sh data imdb

Expected behavior

It should download the dataset, extract the csv files and convert to parquet.

Additional context

The related part in bench.sh

# Downloads the csv.gz files IMDB datasets from Peter Boncz's homepage(one of the JOB paper authors)
# http://homepages.cwi.nl/~boncz/job/imdb.tgz
data_imdb() {
local imdb_dir="${DATA_DIR}/imdb"
local imdb_temp_gz="${imdb_dir}/imdb.tgz"
local imdb_url="https://homepages.cwi.nl/~boncz/job/imdb.tgz"

@alihan-synnada alihan-synnada added the bug Something isn't working label Dec 24, 2024
@Spaarsh
Copy link
Contributor

Spaarsh commented Dec 24, 2024

@alihan-synnada This is more of an issue on the the CWI website. I went to the http://homepages.cwi.nl/~boncz/job/ URL (removing the last part of the url in the code), and found the imdb.tgz dataset link. Upon hovering over it, I came across the link that actually hosts the data: https://event.cwi.nl/da/job/imdb.tgz

I changed the url in the script here and now the code works as expected:
20241225_001005(1)

If this is an acceptable solution, I will make a PR.

Here is the content of the page displayed at http://homepages.cwi.nl/~boncz/job/ (note that even the protocol is wrong, maybe due to changes in the host website, it should be https), notice the imdb.tgz at the bottom:
image

@alihan-synnada
Copy link
Contributor Author

@Spaarsh Nice find! It works for me too. I think you can open the PR and hopefully it will be merged without much delay.

Spaarsh added a commit to Spaarsh/datafusion that referenced this issue Dec 25, 2024
The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used.
@Spaarsh
Copy link
Contributor

Spaarsh commented Dec 25, 2024

@alihan-synnada Done! Thanks! I came across an interesting thing. While reproducing the bug, the HTML response file had been stored at the same path /data/imdb as the original file was intended to. So when i ran the command again after making the changes, I had to manually remove that file since I was getting the "file already exists" error. I suppose these kind of commands ought to be atomic? So that in case of failure, all actions are rolled back? Should I open an issue for this?

@alihan-synnada
Copy link
Contributor Author

@Spaarsh Good catch. I'm not sure what the best approach be since deleting is a destructive operation. We can either prevent the creation of a bad file in the first place (e.g., by verifying its MD5 hash after downloading) and keep the "file already exists" error, or prompt the user for an overwrite if a file exists. I don't know how to do either in bash but I guess you could add it to the PR if it's a simple fix. Otherwise another issue would be great 👍

alamb pushed a commit that referenced this issue Dec 25, 2024
The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used.
@berkaysynnada
Copy link
Contributor

@Spaarsh Good catch. I'm not sure what the best approach be since deleting is a destructive operation. We can either prevent the creation of a bad file in the first place (e.g., by verifying its MD5 hash after downloading) and keep the "file already exists" error, or prompt the user for an overwrite if a file exists. I don't know how to do either in bash but I guess you could add it to the PR if it's a simple fix. Otherwise another issue would be great 👍

@Spaarsh do you plan to work on @alihan-synnada suggestions?

@alamb
Copy link
Contributor

alamb commented Dec 25, 2024

We can either prevent the creation of a bad file in the first place (e.g., by verifying its MD5 hash after downloading) and keep the "file already exists" error, or prompt the user for an overwrite if a file exists

Since this only affects a small number of people (anyone who has tried to download this data recently) I think it is a relatively minor thing to try and fix

If we want to fix it, checking the md5 sum would be my preferred way rather than making bench.sh ask for user input. Asking for user input will make it harder to script / run without attendance

@Spaarsh
Copy link
Contributor

Spaarsh commented Dec 25, 2024

@Spaarsh Good catch. I'm not sure what the best approach be since deleting is a destructive operation. We can either prevent the creation of a bad file in the first place (e.g., by verifying its MD5 hash after downloading) and keep the "file already exists" error, or prompt the user for an overwrite if a file exists. I don't know how to do either in bash but I guess you could add it to the PR if it's a simple fix. Otherwise another issue would be great 👍

@Spaarsh do you plan to work on @alihan-synnada suggestions?

Sure! I do have a different approach though. Before the file is not entirely downloaded, it can be named something similar to imdb.tmp.tgz. If any error occurs, we purge the file before exiting. If no errors are encountered, change the filename to the intended one. This will not need the user's input.

I think git also uses a similar mechanism while cloning a repo. I'll look into its details and show my findings here.

@alamb
Copy link
Contributor

alamb commented Dec 26, 2024

I think git also uses a similar mechanism while cloning a repo. I'll look into its details and show my findings here.

💯 makes a lot of sense

@Spaarsh
Copy link
Contributor

Spaarsh commented Dec 27, 2024

Okay so I went through the code of git clone command here. I also ran the command and did ls mid run, revealing that it had made the repo with the original name. When I cancelled the clone command, it simply removed all the contents and the directory as well. The git clone source code has a remove_junk function for this purpose. And just to clarify, the removal of a downloaded file shall be done only in case of an error or exception since those files won't be usable anyway.

I was wondering if there might me similar instances in other parts of our code? Should I make a PR for making this action atomic?

@alamb
Copy link
Contributor

alamb commented Dec 28, 2024

I was wondering if there might me similar instances in other parts of our code? Should I make a PR for making this action atomic?

Making downloads atomic in bench.sh seems like a good improvement to me (as it will prevent issues due to potentially large downloads being interrupted causing confusion)

Not sure about other areas in the code

@Spaarsh
Copy link
Contributor

Spaarsh commented Dec 29, 2024

I have worked a bit on these lines. I have also added traps that ensure that cleanup takes place if the user interrupts the downloads intentionally as well. Trying to apply it to other functions as well now. The output is this so far:

For the benchmarks/bench.sh data imdb command:

20241230_012510

For the benchmarks/bench.sh data clickbench_1 command:

20241230_012240

If this looks good enough, should open an issue for this and make a PR? Or directly open a PR?
I have pushed this code to my fork here. Suggestions are welcome!

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants