Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a download link for the dataset used in Table 3 #58

Open
song2000012138 opened this issue Oct 15, 2023 · 17 comments
Open

Is there a download link for the dataset used in Table 3 #58

song2000012138 opened this issue Oct 15, 2023 · 17 comments

Comments

@song2000012138
Copy link

how can i have the second dataset, can you give me some guidance?

@song2000012138
Copy link
Author

Is this data-2016.tar.gz the dataset uesd in table 3?

@JNU-luyi
Copy link

I feel like they are "train_data" and "test_data," but I'm curious about how these two files were generated.

Based on your description, the data collected before January 2016 serves as the training set, and the experimental annotations collected between January and October are used as the test set. I have already downloaded the UniProt Swiss-Prot data files for January 2016 and October 2016 from the FTP. Is there any code available for extracting the test set from this data? I am eager to learn how this part is done. Thank you!

@coolmaksat
Copy link
Contributor

Hi,
We use the uniprot_sprot.dat.gz files from UniProt and parse the using uni2pandas.py script. It generates a pandas DataFrame which is saved in swissprot.pkl file. This file is passed to deepgoplus_data.py script which splits it into train/valid/test

@JNU-luyi
Copy link

image

My screenshot is a portion of deepgoplus_data.py. It seems that the purpose of this script is solely to split swissprot.pkl into training and validation sets based on a specified ratio.

I did indeed do it this way. I first downloaded the uniprot_sprot.dat files for January 2016 and then for October 2016. Subsequently, I used the uni2pandas.py script to generate swissprot.pkl. However, the deepgoplus_data.py script only divides this swissprot.pkl into training and testing sets based on a specified ratio, rather than following the procedure outlined in the paper where data up to January 2016 is used as the training set and experimental annotations added from January to October are used as the testing set. I'd like to know how this part of the code is implemented. I have already downloaded the uniprot_sprot.dat files for January 2016 and October 2016.

@coolmaksat
Copy link
Contributor

Hi,
Yes, you are right. For some reason, we I changed this implementation to random split. I found the old code which does the time split here https://github.com/bio-ontology-research-group/deepgoplus/blob/7fb4af440db6df67d3e3e90cb663156110bfaf09/deepgoplus_data.py

@JNU-luyi
Copy link

Thank you very much. Is the link you just provided part of your previous old project's scripts? Can this old project be found in your repository?

@coolmaksat
Copy link
Contributor

Yes, it is a link to an old commit

@JNU-luyi
Copy link

I attempted to view 'swissprot_exp201601.pkl' and 'swissprot_exp201610.pkl' in the 'data-2016' directory. The former contains 65,028 records, which aligns with the results mentioned in the paper. However, I'm wondering why 'swissprot_exp201610' only displays 55,871 records. Shouldn't the data for October have a larger volume? I also tried downloading data from the UniProt website for January 2016 and October 2016 and processed them using the 'uniprot2pandas.py' script, resulting in 65,028 proteins for January 2016. However, the data for October 2016 does not have 55,871 records. I'm currently trying to generate a test set of 1,788 proteins following the methods in your paper, but I'm unsure where the issue lies, and I'm unable to complete it

@coolmaksat
Copy link
Contributor

Hi, I think it doesn't have to be more records. Sometimes UniProtKB removes some of the records and adds new ones.
Try to regenerate the pkl files with originals from UniProtKB

@JNU-luyi
Copy link

image
image
image

I downloaded the data from the same source, and using that file, I generated a dataset for October 2016 with 67,271 entries. However, your 'swissprot_exp201610.pkl' file contains only 55,871 entries. I've been struggling with this issue for several days now and have been unable to generate the 1,788 test set as mentioned in the paper. I can't seem to identify the root cause of the problem. I would sincerely appreciate your assistance

@coolmaksat
Copy link
Contributor

How many proteins you get in each sub-ontology test sets?

@JNU-luyi
Copy link

JNU-luyi commented Oct 27, 2023 via email

@coolmaksat
Copy link
Contributor

You will not get 1788 proteins in a single set, that is the number of unique proteins in all sub-datasets. What you get is 3 test sets for each sub-ontology. Please check the numbers and compare the test sets for each sub ontology.

@JNU-luyi
Copy link

Possibly, you may not have understood my point. Currently, my concern is the generation of the test set in your second dataset, specifically, the creation of those 1788 data points mentioned in the paper.

According to the logic in your script, the general process is as follows:

Initially download the original source files for uniprot_2016_01 and uniprot_2016_10.
Use the script uniprot2pandas.py to convert these two datasets into DataFrame (df) type and save them as pkl data files.
During the conversion, I observed that the data count for 2016_01 is indeed 65,028 entries. This data can be directly used as the training set. However, for the 2016_10 data, the pkl file I generated contains 67,271 entries, while your "swissprot_exp201610.pkl" file only includes 55,871 entries.
In the deepgoplus_data.py file that you recently sent me, there is code for dividing the test set based on time. The logic of this code roughly involves removing the protein IDs contained in the data from January 2016 from the data in October 2016, leaving the rest as the test set. Due to the inconsistencies between the protein entries in the 2016_10 pkl file I generated and the one you provided, "swissprot_exp201610.pkl," it results in an inconsistency in the number of entries in the generated test set.
This is where my confusion lies. I'm not sure if the issue is related to the uniprot source files I downloaded or the data processing. I used your project's scripts for data processing. I'll screenshot the URLs of my source file downloads and include them below.
image

@JNU-luyi
Copy link

Is this data-2016.tar.gz the dataset uesd in table 3?

Have you solved this issue, and may I ask you some questions

@JNU-luyi
Copy link

You will not get 1788 proteins in a single set, that is the number of unique proteins in all sub-datasets. What you get is 3 test sets for each sub-ontology. Please check the numbers and compare the test sets for each sub ontology.

hi

@lamsiharsiahaan
Copy link

image

disini saya memiliki data Train dan Test (Target) dari CAFA5. Bagaimana caranya saya bisa menerapkan data yang saya miliki kedalam file kode uni2pandas.py? bisa bantu saya?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants