record name cleaning #20

M-Zeeb · 2023-04-24T08:19:13Z

Hi,

thanks for the great tool.

I stumbled upon a small issue when I was blindly following the instructions to gain viral marker genes (HIV in my case).
It seems the "clean_fasta_cdna_cds.py" file does not sufficiently clean the names as I had issues downstream due to underscores "_". Resulting in "Keyerrors" at various steps. For example when generating the references.
Although, it may be that I misunderstood the instructions, after manually removing all underscores it was resolved.

But this is an example of the error:

Example name: "02495|KC156214.1_AGF30950.1_2 [02495]"

Error at reference-generation
(I actually could fix this with split "OG" instead of "" in lines 326-328 of "OGSet.py" but then I had errors at the final merging step):

`read2tree  --standalone_path  marker_genes/  --reference --dna_reference  all_cdna_out.fa  

--- Load OGs with min 0 species from oma marker_genes - mode = marker_genes ---

Loading files for pre-filter: 100%|███████████| 9/9 [00:00<00:00, 8355.19 OGs/s]
2023-04-24 10:07:05,211 - read2tree.OGSet - INFO - 

--- Load ogs and find their corresponding DNA seq from all_cdna_out.fa ---

2023-04-24 10:07:05,211 - read2tree.OGSet - INFO - Loading all_cdna_out.fa into memory. This might take a while . . . 
Loading OGs:   0%|                                      | 0/9 [00:00<?, ? OGs/s]

Traceback (most recent call last):

  File "/Users/mz/opt/anaconda3/envs/r2t/bin/read2tree", line 16, in <module>
    main(sys.argv[1:], exe_name=exe_name(), desc=desc)
    
  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/main.py", line 289, in main
    ogset = OGSet(args, oma_output=oma_output, progress=progress)  # Generate the OGs with their DNA sequences
    
  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 79, in __init__
    self.ogs = self._load_ogs()
    
  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 186, in _load_ogs
    ogs[name].dna = self._get_dna_records(ogs[name].aa,
    
  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 365, in _get_dna_records
    og_cdna.append(self._get_dna_from_fasta(record, db))
    
  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 326, in _get_dna_from_fasta
    return self._get_dna_from_REST(record) 
    
  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 282, in _get_dna_from_REST
    seq = oma_record.json()['cdna']
    
KeyError: 'cdna'`

Original files:
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/viral/Human_immunodeficiency_virus_1/all_assembly_versions/GCA_003202495.1_ASM320249v1/GCA_003202495.1_ASM320249v1_translated_cds.faa.gz
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/viral/Human_immunodeficiency_virus_1/all_assembly_versions/GCA_003202495.1_ASM320249v1/GCA_003202495.1_ASM320249v1_cds_from_genomic.fna.gz

The text was updated successfully, but these errors were encountered:

sinamajidian · 2023-04-24T15:35:09Z

Dear @M-Zeeb

I've just updated the code which you can download from here. So it doesn't affect the read2tree installation. I tested the new version with the provided assembly and it is working. Please make sure that you remove the output from previous run and let me know whether it works for you. And I'm sorry for the inconvenience.

Regards,
Sina

M-Zeeb · 2023-04-25T12:46:34Z

Dear Sina,

thanks for the quick response!
It works now.

Best,
Marius

sinamajidian added a commit that referenced this issue Apr 24, 2023

update clean_fasta_cdna_cds #20

e10f629

M-Zeeb closed this as completed Apr 25, 2023

sci-study mentioned this issue Jul 14, 2023

read2tree can't find corresponding CDS for each OMA group #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

record name cleaning #20

record name cleaning #20

M-Zeeb commented Apr 24, 2023

sinamajidian commented Apr 24, 2023

M-Zeeb commented Apr 25, 2023

record name cleaning #20

record name cleaning #20

Comments

M-Zeeb commented Apr 24, 2023

sinamajidian commented Apr 24, 2023

M-Zeeb commented Apr 25, 2023