Is there a download link for the dataset used in Table 3 #58

song2000012138 · 2023-10-15T08:48:53Z

how can i have the second dataset, can you give me some guidance?

song2000012138 · 2023-10-15T09:53:16Z

Is this data-2016.tar.gz the dataset uesd in table 3?

JNU-luyi · 2023-10-26T14:14:22Z

I feel like they are "train_data" and "test_data," but I'm curious about how these two files were generated.

Based on your description, the data collected before January 2016 serves as the training set, and the experimental annotations collected between January and October are used as the test set. I have already downloaded the UniProt Swiss-Prot data files for January 2016 and October 2016 from the FTP. Is there any code available for extracting the test set from this data? I am eager to learn how this part is done. Thank you!

coolmaksat · 2023-10-26T17:58:36Z

Hi,
We use the uniprot_sprot.dat.gz files from UniProt and parse the using uni2pandas.py script. It generates a pandas DataFrame which is saved in swissprot.pkl file. This file is passed to deepgoplus_data.py script which splits it into train/valid/test

JNU-luyi · 2023-10-27T01:25:41Z

My screenshot is a portion of deepgoplus_data.py. It seems that the purpose of this script is solely to split swissprot.pkl into training and validation sets based on a specified ratio.

I did indeed do it this way. I first downloaded the uniprot_sprot.dat files for January 2016 and then for October 2016. Subsequently, I used the uni2pandas.py script to generate swissprot.pkl. However, the deepgoplus_data.py script only divides this swissprot.pkl into training and testing sets based on a specified ratio, rather than following the procedure outlined in the paper where data up to January 2016 is used as the training set and experimental annotations added from January to October are used as the testing set. I'd like to know how this part of the code is implemented. I have already downloaded the uniprot_sprot.dat files for January 2016 and October 2016.

coolmaksat · 2023-10-27T06:47:37Z

Hi,
Yes, you are right. For some reason, we I changed this implementation to random split. I found the old code which does the time split here https://github.com/bio-ontology-research-group/deepgoplus/blob/7fb4af440db6df67d3e3e90cb663156110bfaf09/deepgoplus_data.py

JNU-luyi · 2023-10-27T07:15:30Z

Thank you very much. Is the link you just provided part of your previous old project's scripts? Can this old project be found in your repository?

coolmaksat · 2023-10-27T07:32:25Z

Yes, it is a link to an old commit

JNU-luyi · 2023-10-27T07:52:01Z

I attempted to view 'swissprot_exp201601.pkl' and 'swissprot_exp201610.pkl' in the 'data-2016' directory. The former contains 65,028 records, which aligns with the results mentioned in the paper. However, I'm wondering why 'swissprot_exp201610' only displays 55,871 records. Shouldn't the data for October have a larger volume? I also tried downloading data from the UniProt website for January 2016 and October 2016 and processed them using the 'uniprot2pandas.py' script, resulting in 65,028 proteins for January 2016. However, the data for October 2016 does not have 55,871 records. I'm currently trying to generate a test set of 1,788 proteins following the methods in your paper, but I'm unsure where the issue lies, and I'm unable to complete it

coolmaksat · 2023-10-27T10:32:41Z

Hi, I think it doesn't have to be more records. Sometimes UniProtKB removes some of the records and adds new ones.
Try to regenerate the pkl files with originals from UniProtKB

JNU-luyi · 2023-10-27T12:03:20Z

I downloaded the data from the same source, and using that file, I generated a dataset for October 2016 with 67,271 entries. However, your 'swissprot_exp201610.pkl' file contains only 55,871 entries. I've been struggling with this issue for several days now and have been unable to generate the 1,788 test set as mentioned in the paper. I can't seem to identify the root cause of the problem. I would sincerely appreciate your assistance

coolmaksat · 2023-10-27T14:00:42Z

How many proteins you get in each sub-ontology test sets?

JNU-luyi · 2023-10-27T14:59:54Z

I didn't look at the number of each sub-ontology. My current problem is that the data protein entries I generated from the October 2016 data are different from the data entries you gave me. I couldn't extract the 1788 test sets

…

---- Replied Message ---- | From | Maxat ***@***.***> | | Date | 10/27/2023 22:00 | | To | ***@***.***> | | Cc | ***@***.***>***@***.***> | | Subject | Re: [bio-ontology-research-group/deepgoplus] Is there a download link for the dataset used in Table 3 (Issue #58) | How many proteins you get in each sub-ontology test sets? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

coolmaksat · 2023-10-27T21:10:59Z

You will not get 1788 proteins in a single set, that is the number of unique proteins in all sub-datasets. What you get is 3 test sets for each sub-ontology. Please check the numbers and compare the test sets for each sub ontology.

JNU-luyi · 2023-10-28T06:13:33Z

Possibly, you may not have understood my point. Currently, my concern is the generation of the test set in your second dataset, specifically, the creation of those 1788 data points mentioned in the paper.

According to the logic in your script, the general process is as follows:

Initially download the original source files for uniprot_2016_01 and uniprot_2016_10.
Use the script uniprot2pandas.py to convert these two datasets into DataFrame (df) type and save them as pkl data files.
During the conversion, I observed that the data count for 2016_01 is indeed 65,028 entries. This data can be directly used as the training set. However, for the 2016_10 data, the pkl file I generated contains 67,271 entries, while your "swissprot_exp201610.pkl" file only includes 55,871 entries.
In the deepgoplus_data.py file that you recently sent me, there is code for dividing the test set based on time. The logic of this code roughly involves removing the protein IDs contained in the data from January 2016 from the data in October 2016, leaving the rest as the test set. Due to the inconsistencies between the protein entries in the 2016_10 pkl file I generated and the one you provided, "swissprot_exp201610.pkl," it results in an inconsistency in the number of entries in the generated test set.
This is where my confusion lies. I'm not sure if the issue is related to the uniprot source files I downloaded or the data processing. I used your project's scripts for data processing. I'll screenshot the URLs of my source file downloads and include them below.

JNU-luyi · 2023-10-30T10:10:00Z

Is this data-2016.tar.gz the dataset uesd in table 3?

Have you solved this issue, and may I ask you some questions

JNU-luyi · 2023-10-31T01:17:27Z

You will not get 1788 proteins in a single set, that is the number of unique proteins in all sub-datasets. What you get is 3 test sets for each sub-ontology. Please check the numbers and compare the test sets for each sub ontology.

hi

lamsiharsiahaan · 2024-04-01T05:25:29Z

disini saya memiliki data Train dan Test (Target) dari CAFA5. Bagaimana caranya saya bisa menerapkan data yang saya miliki kedalam file kode uni2pandas.py? bisa bantu saya?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a download link for the dataset used in Table 3 #58

Is there a download link for the dataset used in Table 3 #58

song2000012138 commented Oct 15, 2023

song2000012138 commented Oct 15, 2023

JNU-luyi commented Oct 26, 2023

coolmaksat commented Oct 26, 2023

JNU-luyi commented Oct 27, 2023

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 27, 2023

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 27, 2023

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 27, 2023

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 27, 2023 via email

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 28, 2023

JNU-luyi commented Oct 30, 2023

JNU-luyi commented Oct 31, 2023

lamsiharsiahaan commented Apr 1, 2024

Is there a download link for the dataset used in Table 3 #58

Is there a download link for the dataset used in Table 3 #58

Comments

song2000012138 commented Oct 15, 2023

song2000012138 commented Oct 15, 2023

JNU-luyi commented Oct 26, 2023

coolmaksat commented Oct 26, 2023

JNU-luyi commented Oct 27, 2023

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 27, 2023

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 27, 2023

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 27, 2023

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 27, 2023 via email

coolmaksat commented Oct 27, 2023

JNU-luyi commented Oct 28, 2023

JNU-luyi commented Oct 30, 2023

JNU-luyi commented Oct 31, 2023

lamsiharsiahaan commented Apr 1, 2024