Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error check in get_ext(file_name) function in gdc.py #1

Open
arp95 opened this issue Feb 19, 2019 · 3 comments
Open

Add error check in get_ext(file_name) function in gdc.py #1

arp95 opened this issue Feb 19, 2019 · 3 comments

Comments

@arp95
Copy link
Contributor

arp95 commented Feb 19, 2019

Hi,

The get_ext(filename) functions splits the filename on '.' and checks for extensions supported by a constant called _SUPPORTED_FILE_TYPES.

There is a case in which if the filename doesnt have any supported format available in the _SUPPORTED_FILE_TYPES array, in that case we are returning the format of the file which is not supported by our code. We should add an error check and return NULL or a default extension supported by _SUPPORTED_FILE_TYPES array incase the filename doesnt have a valid ext supported by our code.

@maryjgoldman
Copy link

@jingchunzhu

@yunhailuo yunhailuo assigned yunhailuo and unassigned yunhailuo Feb 19, 2019
@Ayoob7
Copy link

Ayoob7 commented Mar 8, 2019

Hello, I would like to fix the issue with the function, whats the procedure to push my branch ? Thanks in advance.

@yunhailuo
Copy link
Collaborator

Thank you for your interests and PR, @Ayoob7 . This issue is a little bit special. The bottom line is it is not a bug but can use some improvements in coding and documentation.

If I remember it correctly, back in the day, GDC has some files named with only UUID which I found not useful to me and decided to rename them while downloading. So if you check the use of get_ext, you would find that they are used for constructing the new filename. Here is why I have this function and how I'd like it to work:

Say we have a file named "uuid.txt.tar.gz", I want to keep all the "txt.tar.gz" part and replace uuid with some meaningful name, say becoming "patient1_cancer.txt.tar.gz" not "patient1_cancer.gz". So get_ext should return "txt.tar.gz". This is important because if you unzip and extract the file, you can and should still have a "txt" extension. This is why I didn't simply keep the last postfix, "gz", but has get_ext to return that joint string. For what @arp95 has mentioned, say txt, tar and gz are not supported. So the filename doesn't have any supported format available, it will return the last postfix, gz, which is not "supported". Yes indeed but it is what I want. I want to get at least the last part rather than nothing because I want an extension no matter what. In another word, get_ext would check filename from left to right and return as soon as it found a recognized extension type and if none of them is recognized it should return the last one so that I have a relevant extension for file renaming rather than nothing. In this sense, I say it is not a bug because it functions as I want.

That being said, this issue can be improved. Either it should not be called "supported" but something else like "recognized" and documented clearly, or the whole file downloading and renaming part should be rewrite. In my opinion, those are directions this issue should go.

I'll close #7 for now but feel free to comment or submit new PR if you have any suggestions. Again, thank you both for your interests.

yunhailuo added a commit to yunhailuo/xena-GDC-ETL that referenced this issue Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants