Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load timit_asr data set #4422

Closed
bhaddow opened this issue May 30, 2022 · 6 comments · Fixed by #4424
Closed

Cannot load timit_asr data set #4422

bhaddow opened this issue May 30, 2022 · 6 comments · Fixed by #4424
Assignees
Labels
bug Something isn't working

Comments

@bhaddow
Copy link

bhaddow commented May 30, 2022

Describe the bug

I am trying to load the timit_asr data set. I have tried with a copy from the LDC, and a copy from deepai. In both cases they fail with a "duplicate key" error. With the LDC version I have to convert the file extensions all to upper-case before I can load it at all.

Steps to reproduce the bug

timit = datasets.load_dataset("timit_asr", data_dir = "/path/to/dataset")
# Sample code to reproduce the bug

Expected results

The data set should load without error. It worked for me before the LDC url change.

Actual results

datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: SA1
Keys should be unique and deterministic in nature

Environment info

  • datasets version:
  • datasets version: 2.2.2
  • Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.17
  • Python version: 3.8.12
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.2
@bhaddow bhaddow added the bug Something isn't working label May 30, 2022
@albertvillanova
Copy link
Member

Thanks for reporting, @bhaddow.

I'm fixing it.

@bhaddow
Copy link
Author

bhaddow commented May 31, 2022

Thanks for the quick fix!

@albertvillanova
Copy link
Member

albertvillanova commented Jun 1, 2022

@bhaddow we have also made a fix so that you don't have to convert to uppercase the file extensions of the LDC data.

Would you mind checking if it works OK now for you and reporting if there are any issues? Thanks.

@bhaddow
Copy link
Author

bhaddow commented Jun 1, 2022

Hi @albertvillanova -It loads fine on a copy of the data from deepai - although I have to remove the copies of the .WAV files (with extension .WAV,wav). On a copy of the data that was obtained from the LDC, the glob still fails to find the files. The LDC copy looks like it was copied from CD, in 2004, so the structure may be different to a current download.

@bhaddow
Copy link
Author

bhaddow commented Jun 1, 2022

Ah, if I change the train/ and test/ directories to TRAIN/ and TEST/ then it works!

@albertvillanova
Copy link
Member

Thanks for your investigation and report, @bhaddow. I'm adding another fix for the TRAIN/train and TEST/test directory names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants