-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloading IMDB dataset for benchmarks gives 404 Not Found #13896
Comments
@alihan-synnada This is more of an issue on the the CWI website. I went to the http://homepages.cwi.nl/~boncz/job/ URL (removing the last part of the url in the code), and found the imdb.tgz dataset link. Upon hovering over it, I came across the link that actually hosts the data: https://event.cwi.nl/da/job/imdb.tgz I changed the url in the script here and now the code works as expected: If this is an acceptable solution, I will make a PR. Here is the content of the page displayed at http://homepages.cwi.nl/~boncz/job/ (note that even the protocol is wrong, maybe due to changes in the host website, it should be https), notice the imdb.tgz at the bottom: |
@Spaarsh Nice find! It works for me too. I think you can open the PR and hopefully it will be merged without much delay. |
The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used.
@alihan-synnada Done! Thanks! I came across an interesting thing. While reproducing the bug, the HTML response file had been stored at the same path /data/imdb as the original file was intended to. So when i ran the command again after making the changes, I had to manually remove that file since I was getting the "file already exists" error. I suppose these kind of commands ought to be atomic? So that in case of failure, all actions are rolled back? Should I open an issue for this? |
@Spaarsh Good catch. I'm not sure what the best approach be since deleting is a destructive operation. We can either prevent the creation of a bad file in the first place (e.g., by verifying its MD5 hash after downloading) and keep the "file already exists" error, or prompt the user for an overwrite if a file exists. I don't know how to do either in bash but I guess you could add it to the PR if it's a simple fix. Otherwise another issue would be great 👍 |
@Spaarsh do you plan to work on @alihan-synnada suggestions? |
Since this only affects a small number of people (anyone who has tried to download this data recently) I think it is a relatively minor thing to try and fix If we want to fix it, checking the |
Sure! I do have a different approach though. Before the file is not entirely downloaded, it can be named something similar to imdb.tmp.tgz. If any error occurs, we purge the file before exiting. If no errors are encountered, change the filename to the intended one. This will not need the user's input. I think git also uses a similar mechanism while cloning a repo. I'll look into its details and show my findings here. |
💯 makes a lot of sense |
Okay so I went through the code of git clone command here. I also ran the command and did I was wondering if there might me similar instances in other parts of our code? Should I make a PR for making this action atomic? |
Making downloads atomic in bench.sh seems like a good improvement to me (as it will prevent issues due to potentially large downloads being interrupted causing confusion) Not sure about other areas in the code |
I have worked a bit on these lines. I have also added traps that ensure that cleanup takes place if the user interrupts the downloads intentionally as well. Trying to apply it to other functions as well now. The output is this so far: For the benchmarks/bench.sh data imdb command:For the benchmarks/bench.sh data clickbench_1 command:If this looks good enough, should open an issue for this and make a PR? Or directly open a PR? Thanks! |
Describe the bug
Attempting to download the IMDB dataset gives the following error:
An
IMDB.tgz
is created with the following content:It seems the dataset is removed or unavailable.
To Reproduce
Run
benchmarks/bench.sh data imdb
Expected behavior
It should download the dataset, extract the csv files and convert to parquet.
Additional context
The related part in
bench.sh
datafusion/benchmarks/bench.sh
Lines 458 to 463 in 6cfd1cf
The text was updated successfully, but these errors were encountered: