Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Duplicated metadata when querying metadata for single run accession #89

Closed
kpj opened this issue Dec 16, 2020 · 9 comments
Closed
Labels
bug Something isn't working

Comments

@kpj
Copy link

kpj commented Dec 16, 2020

Describe the bug
In some cases, when using SRAweb.sra_metadata with a single run accession, multiple metadata rows are returned. It would seem more sensible to only return the metadata for the requested run accession.
This is e.g. problematic when retrieving metadata for a list of samples and expecting the number of rows to be equal to the number of queried samples.

To Reproduce
Execute the following code:

>>> from pysradb.sraweb import SRAweb

>>> db = SRAweb()
>>> db.sra_metadata('SRR12169246', detailed=True)  # returns metadata for both SRR12169246 and SRR12169247
#   run_accession study_accession experiment_accession  ...                                                                       ena_fastq_ftp ena_fastq_ftp_1 ena_fastq_ftp_2
# 0  SRR12169247   SRP270837       SRX8684079           ...  [email protected]:vol1/fastq/SRR121/047/SRR12169247/SRR12169247.fastq.gz  N/A             N/A           
# 1  SRR12169246   SRP270837       SRX8684079           ...  [email protected]:vol1/fastq/SRR121/046/SRR12169246/SRR12169246.fastq.gz  N/A             N/A           

[2 rows x 32 columns]
>>> db.sra_metadata('SRR12169247', detailed=True)  # returns metadata for both SRR12169246 and SRR12169247
#   run_accession study_accession experiment_accession  ...                                                                       ena_fastq_ftp ena_fastq_ftp_1 ena_fastq_ftp_2
# 0  SRR12169247   SRP270837       SRX8684079           ...  [email protected]:vol1/fastq/SRR121/047/SRR12169247/SRR12169247.fastq.gz  N/A             N/A           
# 1  SRR12169246   SRP270837       SRX8684079           ...  [email protected]:vol1/fastq/SRR121/046/SRR12169246/SRR12169246.fastq.gz  N/A             N/A           

[2 rows x 32 columns]

Desktop:

  • OS: Linux
  • Python version: 3.8.5
  • pysradb version: 0.11.2-dev0
@kpj kpj added the bug Something isn't working label Dec 16, 2020
@saketkc
Copy link
Owner

saketkc commented Dec 16, 2020

Thanks for the bug report @kpj! I think the reason this bug results in two runs is because that happens when you also search it via the NCBI-SRA website. For example see: https://www.ncbi.nlm.nih.gov/sra/?term=SRR12169246
That said, it can be handled internally - I will get to it this week.

@kpj
Copy link
Author

kpj commented Dec 16, 2020

Thanks! I came across a similar issue when fetching metadata manually and ended up subsetting the dataframe.

Maybe there's a better of way of handling this.

@saketkc
Copy link
Owner

saketkc commented Dec 25, 2020

For now, I would recommend the fix you have in place. It is slightly tricky to deal this internally given the passed in argument could be anything (SRP/SRR/SRX/GSM etc.). The origin of this is not at pysradb end, but what NCBI search itself returns (see above comment)

@kpj
Copy link
Author

kpj commented Dec 25, 2020

Is the main issue to figure out which column to detect duplicates in/which column to select the accessions from?
In that case it might be an idea to add a parameter such as duplicate_accession_removal_column which would be run_accession when input accessions are of the form ERR4413803.

This is certainly not very elegant and maybe there are other issues making this more difficult, so I am happy either way :)

@fatyang799
Copy link

I met the same question. And I am confused about the relationship between multiple SRR IDs within a single SRX ID. Are these SRR IDs technical replicates from a shared sequencing library?
The manual in NCBI made me really confused. And I would appreciate it if you could tell me your understanding of this question.

@saketkc
Copy link
Owner

saketkc commented Feb 22, 2023

Yes, SRRs for the same SRX are technical replicates. Here are some slides that might help: https://f1000research.com/slides/8-1183

@fatyang799
Copy link

Yes, SRRs for the same SRX are technical replicates. Here are some slides that might help: https://f1000research.com/slides/8-1183

Many thanks for your quick reply!!

In passing, I would like to raise here another problem that I encountered in the course of using. The metadata I prefetch by pysradb metadata --detailed do not include some important info.

For example, I want to acquire antibody info of a ChIPseq ([SRX027872](https://www.ncbi.nlm.nih.gov/sra/SRX027872[accn])). On the web of NCBI, I can see the antibody info (Experiment attributes part). But there is no related info in metadata I prefetch by pysradb.

@saketkc
Copy link
Owner

saketkc commented Feb 22, 2023

@sheep-liu thanks for brining it to my attention. I have pushed 7da562f which enables fetching experiment protocol. It will be in the next release (you can install the develop version from github for now).

For future, please create a new issue. I will close this for now as I think the original issue it is best handled downstream.

@saketkc saketkc closed this as completed Feb 22, 2023
@fatyang799
Copy link

@sheep-liu thanks for brining it to my attention. I have pushed 7da562f which enables fetching experiment protocol. It will be in the next release (you can install the develop version from github for now).

For future, please create a new issue. I will close this for now as I think the original issue it is best handled downstream.

Roger! And thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants