Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

show read fate - headers from fastq #48

Open
handibles opened this issue Jan 22, 2020 · 17 comments
Open

show read fate - headers from fastq #48

handibles opened this issue Jan 22, 2020 · 17 comments
Assignees
Labels
enhancement New feature or request next version

Comments

@handibles
Copy link

handibles commented Jan 22, 2020

Hey devs,

Thanks for all the work on this. I'm checking the fate of the fastq reads moving through mOTUs (i.e., which specific reads are assigned, which are not etc.), so interested in the headers from the fastq input.

These headers make up the bottom 99% of the BAM/SAM file and can be obtained as below - but could this header output be set as a flag during the pipeline? Or perhaps in the wiki, outline the general format & composition of a BAM file w.r.t to the work mOTUs is doing? Could help the next person.

## get yr headers

# generate motus profile and save out the BAM file
motus profile -s test1.fastq -I test1.bam

# convert binary BAM to scanable SAM, all reads
samtools view -h test1.bam > test1.SAM

      # compare the start of the SAM file-    
      head -100 test1.SAM

      # and the end of the file - 
      tail -100 test1.SAM

# using samtools to get all MATCHED reads
samtools view -F 4 test1.bam > test1_matched.SAM

# or using samtools to get all UNMAPPED reads    -    edit :: WON'T WORK, unmapped not recorded, see below
samtools view -f 1 test1.bam > test1_lonely.SAM


@AlessioMilanese AlessioMilanese self-assigned this Jan 22, 2020
@AlessioMilanese AlessioMilanese added the enhancement New feature or request label Jan 22, 2020
@AlessioMilanese
Copy link
Member

Hi @handibles,

Thanks for your interest in mOTUs!

If I understand correctly, you are suggesting how to improve the tool/wiki.

These headers make up the bottom 99% of the BAM/SAM file

This depends on how many reads there are in the fasta file. With few reads, then yes the majority is header.

but could this header output be set as a flag during the pipeline?

It would be possible to do it, but this is a standard format (BAM or SAM). Printing only the reads would make the file invalid. It's more clean to save a correct SAM/BAM file and then visualise it with samtools (as you showed in your example).

Or perhaps in the wiki, outline the general format & composition of a BAM file w.r.t to the work mOTUs is doing? Could help the next person.

Sure, I added a comment at the end of this wiki page:
https://github.com/motu-tool/mOTUs_v2/wiki/Profile-one-sample

@handibles
Copy link
Author

handibles commented Jan 22, 2020

Ha, nice!

It would be possible to do it, but this is a standard format (BAM or SAM). Printing only the reads would make the file invalid.

Agreed, the last thing we (I) need is invalid files being output into the workflow. However, sometimes it is useful to be able to (e.g.) easily export all the reads that weren't assigned to a separate fastq etc., for later consideration etc.

Anyone who is interested in scrutinising their reads further could try web-searching for samtools or e.g. see the samtool & SAM format guide for info on how to interact with investigating where reads ended up.

As you say - suggestions. Thanks again for all the diligence.

@handibles
Copy link
Author

handibles commented Jan 23, 2020

Apologies for the dread return.

Realised after some more digging that those commands don't return the outputs I expected/described. To clarify:

  1. samtools view -f 4 test1.sam should return all reads not aligned from a given BAM/SAM file - however the mOTUs BAM/SAM file output through BWA only reports reads that successfully aligned to references. I know that I could get the unaligned by subtracting the aligned from the total reads, but is there somewhere else in the output that I could check?

  2. samtools view -F 4 test1.sam(note the capital 'F') does return all read alignments as expected, but as it shows multiple mOTU matches for each read, I'm not sure how to determine which of these matches were ultimately selected for mOTUs' final count output.

My end goal is to determine what reference each successfully aligned read was ultimately assigned to - is there a different portion of the output that I should be looking at?

H

@handibles handibles reopened this Jan 23, 2020
@AlessioMilanese
Copy link
Member

So, we parse the result from bwa. Here is a brief description of what happen in motus map_tax (it's the first step, i.e. mapping and filtering the reads) [check bin/runBWA.py]:

  1. we map reads (for and rev are map separately)
  2. we add information of the lane and type of read. For example, if a read header is @read1, then the read is changed to something like read1.lane0.single
  3. we filter reads (basically at least 75 nucleotide aligning with 97% identity, but there are some other filters)
  4. save the file

These reads are then used to understand to which marker gene cluster they belong with motus calc_mgc (second step, check bin/map_genes_to_mOTUs.py). Note that here we select the best read from the assignments done by bwa.

Moreover, if there are paired-end data, it will select the pair with the best score. Hence, not always the read with the best score is selected. If, for example, read1_for map to MGC1 with score 75 amd MGC2 with score 65; and read2_rev map to MGC1 with score 73 and to MGC2 with score 75. Then, we will select for read1 (both for and rev) to be mapping to MGC1 (score 75+73) and not MGC2 (score 65+75). But, if you look at the reads alone, then you would chose read1_rev to MGC2.

Also, we choose all reads that map correctly, and then we distribute proportionally the multiple mappers. For example, if read1 map with score 75 to MGC1 and MGC2, then we cannot decide to who to assign it just by looking at the SAM file.

@AlessioMilanese
Copy link
Member

It is not so easy to understand exactly what happen of the reads, because at certain point we abstract from the single reads and we have just read counts for MGCs, and do some counts/split on that.

That said, all the reads produce by motus map_tax or motus profile -I are used in the profiler. In 90% of the cases the best match is the one used (unless there are multiple mappers/forward and reverse decisions).

If you try to run:
motus map_tax -s test1.fq | samtools view

where test1.fq.gz is attached:

test1.fq.gz

There are 3 reads (@a, @b and @unmapped_read), where @unmapped_read does not pass the filter. The result of the previous call is:

a.lane0.single	0	metaMG0002471.COG0012	197	3	177M	*	0	0	GAAGCAGCTAATTATCCTTTTGCGACAATTGATCCAAATGTCGGGCGGGTAGAAGTTCCGGATGCTCGTTTGGATAAATTAACAGAACTCATCAAACCACAGAAGAAAGTCCCAACAACATTCGAATTTACTGACATTGCTGGAATTGTGAAAGGTGCTTCAAAAGGAGAAGGACTA	hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh	NM:i:0	MD:Z:177	AS:i:177	XS:i:172
a.lane0.single	256	metaMG0002483.COG0012	197	0	177M	*	0	0	*	*	NM:i:1	MD:Z:14C162	AS:i:172
a.lane0.single	272	metaMG0002468.COG0012	959	0	177M	*	0	0	*	*	NM:i:1	MD:Z:162G14	AS:i:172
a.lane0.single	256	refMG0036809.COG0012	182	0	177M	*	0	0	*	*	NM:i:1	MD:Z:14C162	AS:i:172
a.lane0.single	272	metaMG0002487.COG0012	959	0	177M	*	0	0	*	*	NM:i:1	MD:Z:162G14	AS:i:172
a.lane0.single	256	refMG0030819.COG0012	182	0	177M	*	0	0	*	*	NM:i:4	MD:Z:107G2T8C5G51	AS:i:157
b.lane0.single	0	metaMG0002471.COG0012	817	0	158M	*	0	0	AGCTACAGAAAATGCTGAAGTTGTTATCATTTCAGCGCGTGCAGAGGAAGAAATTTCTGAATTGGACGATGAAGATAAGTCAGAATTTCTGGAAGCTATCGGCTTGACAGAATCAGGTGTGGATAAACTGACCAGAGCAGCTTATCATCTATTGGGAC	hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh	NM:i:0	MD:Z:158	AS:i:158	XS:i:158
b.lane0.single	272	metaMG0002468.COG0012	358	0	158M	*	0	0	*	*	NM:i:0	MD:Z:158	AS:i:158
b.lane0.single	256	refMG0036809.COG0012	802	0	158M	*	0	0	*	*	NM:i:0	MD:Z:158	AS:i:158
b.lane0.single	272	metaMG0002479.COG0012	358	0	158M	*	0	0	*	*	NM:i:0	MD:Z:158	AS:i:158
b.lane0.single	256	refMG0022381.COG0012	802	0	158M	*	0	0	*	*	NM:i:1	MD:Z:150G7	AS:i:153
b.lane0.single	256	metaMG0002483.COG0012	817	0	158M	*	0	0	*	*	NM:i:1	MD:Z:150G7	AS:i:153
b.lane0.single	272	metaMG0002477.COG0012	358	0	158M	*	0	0	*	*	NM:i:1	MD:Z:122A35	AS:i:153
b.lane0.single	272	metaMG0002487.COG0012	358	0	158M	*	0	0	*	*	NM:i:2	MD:Z:106C25C25	AS:i:148
b.lane0.single	272	metaMG0002492.COG0012	358	0	158M	*	0	0	*	*	NM:i:3	MD:Z:106C23A1C25	AS:i:143

I would like to make you notice that:

  • all read in the result (there are 2 reads) pass the filter and are used to calculate the relative abundances
  • read @unmapped_read is not in the file because it doesn't pass the filter
  • @a map uniquely, in fact it has 177M (check CIGAR strings) and NM:i:0 (edit distance to the reference). All other reads have a higher value for NM (four with NM:i:1 and one with NM:i:4)
  • @b map to four genes equally well (158M and NM:i:0), so it's not sure to which it will be assigned. Maybe it's a multiple mapper, but It can even be that all genes are in the same MGC, and then it's not ambiguous.

@Cantalapiedra
Copy link

Cantalapiedra commented Jul 7, 2022

Hi @AlessioMilanese ,

Sorry to comment in an old post, but I don't see the need to ask this separately.
I am inspecting some cases in which I have many reads mapping to genes from a single MGC, but then such MGC is not counted (it is not present in the output of calc_mgc). My only guess so far is that it has something to do with multiple mappers, because among those mappings there are good quality reads, and good enough alignments (both length and identity).

If I understand correctly from some of your answers (in this one and other issues) multiple mappers are counted, distributing the counts among the different genes/MGCs. However, looking at the code, it seems that there are cases where multiple mappers are completely ignored.

map_genes_to_mOTUs.py, line 956
"#if non of the hit references have a unique mapper and more this insert should be shared by more than 3 references --> ignore it"

However, from the code and the comments alone I cannot fully understand what is happening there.

So, my question is: are there cases where multiple mapper reads are discarded?
If yes, which are those cases?

Thank you.

Best,
Carlos

@AlessioMilanese
Copy link
Member

Hi Carlos,

can you send me one or two of these reads?

Best,
alessio

@AlessioMilanese
Copy link
Member

But basically, If there are more than 3 multiple mappings and there are no unique mapper, then the multiple mappers are ignored.

The idea is that if you have a read that map equally well to E. coli, E. albertii and E. fergusonii, but you don't have any unique mapper to any of these species; then it's highly probable that it is an error.

And even if you would like to count it, you would have to split the read in three and count 1/3 for each species. So you would have really low read count for these 3 species which would be removed in further analyses.

In case you have many reads that are all more than 3 multiple mappers, but there is no single mapper for that species; then something fishy is happening.

@Cantalapiedra
Copy link

Hi Carlos,

can you send me one or two of these reads?

Best, alessio

I uploaded all the reads (actually, the SAM mappings; 74022) mapping to the genes (17) from a single MGC (ref_mOTU_v3_03673) which is not appearing after calc_mgc.

reads_example_motus_no_mgc.sam.gz

@Cantalapiedra
Copy link

Cantalapiedra commented Jul 8, 2022

But basically, If there are more than 3 multiple mappings and there are no unique mapper, then the multiple mappers are ignored.

The idea is that if you have a read that map equally well to E. coli, E. albertii and E. fergusonii, but you don't have any unique mapper to any of these species; then it's highly probable that it is an error.

And even if you would like to count it, you would have to split the read in three and count 1/3 for each species. So you would have really low read count for these 3 species which would be removed in further analyses.

In case you have many reads that are all more than 3 multiple mappers, but there is no single mapper for that species; then something fishy is happening.

Thank you very much for the explanation.

EDIT: ignore the next sentences below. I checked the same read with all the mappings (not only the uploaded ones to the single MGC), and it seems to be mapping to genes from other MGCs, so I guess it is a multiple mapper also.

However, after inspecting the mappings I uploaded I am not completely sure this is the case.

For example, the read "ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1" seems to be mapping uniquely to "refMG0012003.COG0172". It is true that its pair is mapping with flag 272 to the same gene, so I am not sure if this is enough to classify it as multiple mapper...

If you could give me a clue of what is happening here it would be great.

Thank you @AlessioMilanese .

Best,
Carlos

@Cantalapiedra
Copy link

Hi again,

From the file I uploaded before, I extracted the list of reads, and from this list I obtained all the mappings from those reads (i.e. those mapping to genes from any MGC).

reads_example_motus_no_mgc.all_mappings.sam.gz

I counted how many times I see each read, and none of them seem to be found only once (ranging from 3 to 54 mappings). So yes, this could be the case where no unique mapping is found. Now I have to wonder why is this happening xD

Best,
Carlos

@AlessioMilanese
Copy link
Member

AlessioMilanese commented Jul 8, 2022

I'm wondering if the problem is another one. I create a small fastq file (paired end) with the read you mentioned (ST-K00119:198:HHCNHBBXY:7:1101:13017:33897). And I run mOTUs with:

motus profile -f test1.fq -r test2.fq -I test.bam -M test.MGC -o test.motus

where I save the genes (-I), the MGCs (-M) and the mOTUs (-o).

I see then that it is counted as a multiple mapper (from the motus log):

   Total number of inserts: 1
     - Unique mappers: 0
     - Multiple mappers: 1
     - Ignored multiple mapper without unique hit: 0

Now if we look at what genes it maps to (samtools view test.bam):

read1.2.lane0/1	0	refMG0012003.COG0172	451	0	150M	*	0	0	AACCGAAAACGTTGAAGTCCGCCGCTGGGGCACACCTCGCGAATTTGATTTTGAAATCAAGGATCACGTCGATGTCGGCGGTCCCTTAGGCCTCGACTTTGACGTCGGCGCTAAGCTTGCCGGAACACGCTTCTGCTTCATGAAAGGCCA	AAFFFJJJJFJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJAFJJJAJJJJAJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJ-<AFJJJFJJJJJJJJJJJJJJJAFJF	NM:i:0	MD:Z:150	AS:i:150	XS:i:150
read1.2.lane0/1	256	metaMG0004572.COG0172	451	0	150M	*	0	0	*	*	NM:i:0	MD:Z:150	AS:i:150
read1.2.lane0/1	256	metaMG0004633.COG0172	451	0	150M	*	0	0	*	*	NM:i:2	MD:Z:84G2G62	AS:i:140
read1.2.lane0/1	272	metaMG0004593.COG0172	903	0	150M	*	0	0	*	*	NM:i:0	MD:Z:150	AS:i:150
read1.2.lane0/2	0	metaMG0004633.COG0172	683	0	149M	*	0	0	CCGTACATTGCTAATCAGGAAACCCTGATCGGTACCGGCCAGCTGCCTAAATTCGAAGAAGACTTGTTTGCAGCTAAGAAGGGCGGCCAGGAAGGCGAAGGCGAAATCTTCTACCTCATCCCGACTTCTGAAGTTACGCTCACAAACTC	JFJJJFJJFAAFF<FJFF<AF<FFJJA-7JJJJJJFJ<<FAA-JJJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJ<JJJJJJJJJJJJJJJFJJJJJJFFJAJ7FJFJFJJJJFJJJJFJFJFFJJJJJJAJAFJJJJJFFA-A	NM:i:0	MD:Z:149	AS:i:149	XS:i:149
read1.2.lane0/2	256	GUT_GENOME022162.COG0172	683	0	149M	*	0	0	*	*	NM:i:2	MD:Z:44T62T41	AS:i:139
read1.2.lane0/2	256	GUT_GENOME025181.COG0172	683	0	149M	*	0	0	*	*	NM:i:3	MD:Z:44T62T14A26	AS:i:134
read1.2.lane0/2	256	GUT_GENOME246333.COG0172	683	0	149M	*	0	0	*	*	NM:i:3	MD:Z:44T62T14A26	AS:i:134
read1.2.lane0/2	256	metaMG0004572.COG0172	683	0	149M	*	0	0	*	*	NM:i:0	MD:Z:149	AS:i:149
read1.2.lane0/2	256	refMG0012003.COG0172	683	0	149M	*	0	0	*	*	NM:i:0	MD:Z:149	AS:i:149
read1.2.lane0/2	272	metaMG0004577.COG0172	672	0	149M	*	0	0	*	*	NM:i:2	MD:Z:41A62A44	AS:i:139
read1.2.lane0/2	272	metaMG0004593.COG0172	672	0	149M	*	0	0	*	*	NM:i:0	MD:Z:149	AS:i:149

If I check test.MGC I see:

COG0172.mOTU.v2.0002266     0.5000000000
COG0172.mOTU.v2b.0000252    0.5000000000

It reports the MGC because even if there are not unique mappers (I put only one read), it maps to only 2 MGCs (if it would be more than 3 then it would not be reported).

But then there are no mOTUs reported:

   Warning: the relative abundance is 0 for all the mOTUs

     Number of ref-mOTUs:  0
     Number of meta-mOTUs: 0
     Number of ext-mOTUs:  0

Because there is only one MGC detected. It needs at least 3 MGCs from one mOTUs (default -g 3) to report a mOTUs.

So changing the command to -g 1 would make this mOTU appear. Call:

motus profile -f test1.fq -r test2.fq -g 1 -o test.motu

And we can check with cat test.motu | grep -v "0.0000000000":

Acinetobacter sp. [ref_mOTU_v3_03673]	0.5000000000
Burkholderiales species incertae sedis [meta_mOTU_v3_12785]	0.5000000000

@Cantalapiedra
Copy link

Hi @AlessioMilanese ,

Thank you very much for testing this.

When I run it with all the reads it is not reported neither with -g 1 or -g 3 (obviously, since no MGC is reported after all).

So I guess (let me know if I am wrong), that what you suggest is that the problem is this one "It reports the MGC because even if there are not unique mappers (I put only one read), it maps to only 2 MGCs (if it would be more than 3 then it would not be reported)." So the problem is that I have multiple mapping reads to genes without unique mappers, and that those genes belong to more than 2 MGCs.

If this is the case, I guess I am just losing this information, because these reads are ambiguous (map to conserved regions of the genes?), and they are impossible to assign with certain confidence. I know this is difficult to address (maybe impossible), but in this case it is not only one or more Species which are not detected, but an Order which is not present when these mappings are discarded.

I wonder whether these mappings should/could be assigned to the Order (assuming that all the mapped genes belong to such Order), even when no Species is assigned.

Thank you very much once more, for your time and patience.

Best,
Carlos

@Cantalapiedra
Copy link

Cantalapiedra commented Jul 8, 2022

I checked whether the reads map to genes only from a single order. To do it fine I guess I should keep only the best mappings (discard those with lower identity, etc), but to make things easier I included all the hits. I obtained 2 orders though (Burkholderiales and Pseudomonadales, both from class Betaproteobacteria if I not mistaken).

I was wondering also now... if I had a sample with less sequencing depth, then in some cases (like this example we are discussing) I would detect things just because there would be no enough depth (just by chance) to have reads mapping to more than 2 MGCs?

edit: sorry, I am thinking now the last paragraph makes no sense, because we are considering individual reads here
edit2: sorry again jaja but if they were individual reads, then why when you use a single read the MGC counts are reported, but when I use all the reads they are not. Then it must be that all reads together are considered for the MGC counts?

@Cantalapiedra
Copy link

Cantalapiedra commented Jul 8, 2022

I checked my mappings for the same read, and I have the same results as you, so I wonder why I don't get any MGC count nor detect mOTU ref_mOTU_v3_03673 (meta_mOTU_v3_12785 is detected though):

ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1      0       refMG0012003.COG0172    451     0       150M    *       0       0       AACCGAAAACGTTGAAGTCCGCCGCTGGGGCACACCTCGCGAATTTGATTTTGAAATCAAGGATCACGTCGATGTCGGCGGTCCCTTAGGCCTCGACTTTGACGTCGGCGCTAAGCTTGCCGGAACACGCTTCTGCTTCATGAAAGGCCA    AAFFFJJJJFJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJAFJJJAJJJJAJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJ-<AFJJJFJJJJJJJJJJJJJJJAFJF  NM:i:0  MD:Z:150        AS:i:150        XS:i:150
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1      256     metaMG0004572.COG0172   451     0       150M    *       0       0       *       *       NM:i:0  MD:Z:150        AS:i:150
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1      256     metaMG0004633.COG0172   451     0       150M    *       0       0       *       *       NM:i:2  MD:Z:84G2G62    AS:i:140
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1      272     metaMG0004593.COG0172   903     0       150M    *       0       0       *       *       NM:i:0  MD:Z:150        AS:i:150
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      16      metaMG0004572.COG0172   683     0       149M    *       0       0       CCGTACATTGCTAATCAGGAAACCCTGATCGGTACCGGCCAGCTGCCTAAATTCGAAGAAGACTTGTTTGCAGCTAAGAAGGGCGGCCAGGAAGGCGAAGGCGAAATCTTCTACCTCATCCCGACTTCTGAAGTTACGCTCACAAACTC     JFJJJFJJFAAFF<FJFF<AF<FFJJA-7JJJJJJFJ<<FAA-JJJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJ<JJJJJJJJJJJJJJJFJJJJJJFFJAJ7FJFJFJJJJFJJJJFJFJFFJJJJJJAJAFJJJJJFFA-A   NM:i:0  MD:Z:149        AS:i:149        XS:i:149
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      256     metaMG0004577.COG0172   672     0       149M    *       0       0       *       *       NM:i:2  MD:Z:41A62A44   AS:i:139
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      256     metaMG0004593.COG0172   672     0       149M    *       0       0       *       *       NM:i:0  MD:Z:149        AS:i:149
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     GUT_GENOME022162.COG0172        683     0       149M    *       0       0       *       *       NM:i:2  MD:Z:44T62T41   AS:i:139
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     GUT_GENOME025181.COG0172        683     0       149M    *       0       0       *       *       NM:i:3  MD:Z:44T62T14A26        AS:i:134
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     GUT_GENOME246333.COG0172        683     0       149M    *       0       0       *       *       NM:i:3  MD:Z:44T62T14A26        AS:i:134
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     metaMG0004633.COG0172   683     0       149M    *       0       0       *       *       NM:i:0  MD:Z:149        AS:i:149
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     refMG0012003.COG0172    683     0       149M    *       0       0       *       *       NM:i:0  MD:Z:149        AS:i:149

@AlessioMilanese
Copy link
Member

If I understand correctly, the MGC table is empty?
So if you run motus profile -I test.MGC [...], then test.MGC is empty?

It seems there is a problem in the way the code is running. Can you try to run motus profile --test; the result should be:

 ------------------------------------------------------------------------------
|                               TEST mOTU profiler                             |
 ------------------------------------------------------------------------------

1-- run setup.py: done

2-- Tools and versions:
- python:    correct
- bwa:       correct
- samtools:  correct
- metaSNV:   correct

3-- Taxonomy profiling test:
- Run motus (-v 1, only error messages):
- end motus call

Check resulting profile: correct

@Cantalapiedra
Copy link

Cantalapiedra commented Jul 8, 2022

Hi @AlessioMilanese ,

No, the MGC file is not empty. It is just missing the results that you showed for the single read you tested.
The result of motus profile --test is exactly as the one you showed.

Actually, I ran again motus calc_mgc -i SAM -o COUNTS using as SAM the file I uploaded above (which includes all mappings of the reads which map to genes from mOTU ref_mOTU_v3_03673), and these are the results:

# map_tax unknown | gene database: unknown | 100 | calc_mgc 3.0.1 -y insert.scaled_counts -l 75
unnamed sample
COG0495.mOTU.v2.0001725 2496.6341950236147
COG0215.mOTU.v2.0001782 2818.302496956346
COG0016.mOTU.v2.0001928 2486.4341914358342
COG0533.mOTU.v2.0001812 1470.2779572060317
COG0012.mOTU.v2.0002033 2134.7448621374724
COG0018.mOTU.v2.0001793 5478.124453140666
COG0541.mOTU.v2.0002154 2306.6405502276
COG0018.ext_mOTU_v3_003165      678.7739440996514
COG0172.mOTU.v2.0002266 1273.9701453371777
COG0552.mOTU.v2.0001822 1205.097204435369

If I did everything right, none of those MGCs are from mOTU ref_mOTU_v3_03673, which includes the next MGCs:

COG0012.mOTU.v2b.0000253
COG0016.mOTU.v2b.0000239
COG0018.mOTU.v2b.0000243
COG0172.mOTU.v2b.0000252
COG0215.mOTU.v2b.0003523
COG0495.mOTU.v2b.0000246
COG0533.mOTU.v2b.0000247
COG0541.mOTU.v2b.0000300
COG0552.mOTU.v2b.0000348

When I run, as you did, only the mappings from the single read above (ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0), I get the next results

# map_tax unknown | gene database: unknown | 100 | calc_mgc 3.0.1 -y insert.scaled_counts -l 75
unnamed sample
COG0172.mOTU.v2.0002266 0.5
COG0172.mOTU.v2b.0000252        0.5

COG0172.mOTU.v2b.0000252 belongs to mOTU ref_mOTU_v3_03673. Why this result is only present when I use a single read, and not when the same read mappings are along with mappings from other reads?

Maybe I am doing some error which is leading to this confusion, I don't know.

Thank you once more.

Best,
Carlos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request next version
Projects
None yet
Development

No branches or pull requests

4 participants