-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a download link for the dataset used in Table 3 #58
Comments
Is this data-2016.tar.gz the dataset uesd in table 3? |
I feel like they are "train_data" and "test_data," but I'm curious about how these two files were generated. Based on your description, the data collected before January 2016 serves as the training set, and the experimental annotations collected between January and October are used as the test set. I have already downloaded the UniProt Swiss-Prot data files for January 2016 and October 2016 from the FTP. Is there any code available for extracting the test set from this data? I am eager to learn how this part is done. Thank you! |
Hi, |
My screenshot is a portion of deepgoplus_data.py. It seems that the purpose of this script is solely to split swissprot.pkl into training and validation sets based on a specified ratio. I did indeed do it this way. I first downloaded the uniprot_sprot.dat files for January 2016 and then for October 2016. Subsequently, I used the uni2pandas.py script to generate swissprot.pkl. However, the deepgoplus_data.py script only divides this swissprot.pkl into training and testing sets based on a specified ratio, rather than following the procedure outlined in the paper where data up to January 2016 is used as the training set and experimental annotations added from January to October are used as the testing set. I'd like to know how this part of the code is implemented. I have already downloaded the uniprot_sprot.dat files for January 2016 and October 2016. |
Hi, |
Thank you very much. Is the link you just provided part of your previous old project's scripts? Can this old project be found in your repository? |
Yes, it is a link to an old commit |
I attempted to view 'swissprot_exp201601.pkl' and 'swissprot_exp201610.pkl' in the 'data-2016' directory. The former contains 65,028 records, which aligns with the results mentioned in the paper. However, I'm wondering why 'swissprot_exp201610' only displays 55,871 records. Shouldn't the data for October have a larger volume? I also tried downloading data from the UniProt website for January 2016 and October 2016 and processed them using the 'uniprot2pandas.py' script, resulting in 65,028 proteins for January 2016. However, the data for October 2016 does not have 55,871 records. I'm currently trying to generate a test set of 1,788 proteins following the methods in your paper, but I'm unsure where the issue lies, and I'm unable to complete it |
Hi, I think it doesn't have to be more records. Sometimes UniProtKB removes some of the records and adds new ones. |
I downloaded the data from the same source, and using that file, I generated a dataset for October 2016 with 67,271 entries. However, your 'swissprot_exp201610.pkl' file contains only 55,871 entries. I've been struggling with this issue for several days now and have been unable to generate the 1,788 test set as mentioned in the paper. I can't seem to identify the root cause of the problem. I would sincerely appreciate your assistance |
How many proteins you get in each sub-ontology test sets? |
I didn't look at the number of each sub-ontology. My current problem is that the data protein entries I generated from the October 2016 data are different from the data entries you gave me. I couldn't extract the 1788 test sets
…---- Replied Message ----
| From | Maxat ***@***.***> |
| Date | 10/27/2023 22:00 |
| To | ***@***.***> |
| Cc | ***@***.***>***@***.***> |
| Subject | Re: [bio-ontology-research-group/deepgoplus] Is there a download link for the dataset used in Table 3 (Issue #58) |
How many proteins you get in each sub-ontology test sets?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
You will not get 1788 proteins in a single set, that is the number of unique proteins in all sub-datasets. What you get is 3 test sets for each sub-ontology. Please check the numbers and compare the test sets for each sub ontology. |
Have you solved this issue, and may I ask you some questions |
hi |
how can i have the second dataset, can you give me some guidance?
The text was updated successfully, but these errors were encountered: