Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cdr3 nucleotide sequences #73

Merged
merged 9 commits into from
Aug 31, 2018
Merged

Add cdr3 nucleotide sequences #73

merged 9 commits into from
Aug 31, 2018

Conversation

dcroote
Copy link
Contributor

@dcroote dcroote commented Aug 1, 2018

Fixes #46 . This PR adds the the CDR3 nucleotide sequence reported by IgBLAST to tracer assemble and summarise outputs. The following walks through changes I've made using tracer test as an example. I've also tested the updated Docker image based on this branch for functionality using tracer test, some of my cells, and an empty cell.

I've updated the Dockerfile to download and unzip the necessary optional_file IgBLAST directory to where the IGDATA environmental variable is set in docker_helper_files/docker_wrapper.sh. IgBLAST will then report the sequence as a separate chunk within the -outfmt 7 output. For example:

(From cell1/IgBLAST_output/cell1_TCR_A.IgBLASTOut)

# Sub-region sequence details (nucleotide sequence, translation, start, end)
CDR3    GCTGTGAGGGGGAAGGAGAGGCAATACCGGAAAACTCATC        AVRGKERQYRKTH   280     319

This is then parsed in the process_chunk function of tracer_func.py and included in a new Recombinant class attribute cdr3nt. Finally, the sequence is reported in the (un)filtered_TCR_seqs/(un)filtered_TCRs.txt outputs as a new line for each recombinant:

------------------
cell1
------------------
TCR_A recombinants: 1/2
TCR_B recombinants: 1/2


#TCR_A#
##TRINITY_DN0_c0_g1_i3##
V segment:      TRAV4-2*02
J segment:      TRAJ43*01
ID:     TRAV4-2_TTGAGAATAA_TRAJ43
TPM:    187511.0
Productive:     True
Stop codon:     False
In frame:       True
CDR3nt: GCTGTTGAGAATAACAACAATGCCCCACGA

The CDR3 is also reported in the summarise output recombinants.txt:

cell_name       locus   recombinant_id  productive      reconstructed_length    CDR3nt
cell1   A       TRAV4-2_TTGAGAATAA_TRAJ43       True    325     GCTGTTGAGAATAACAACAATGCCCCACGA
cell1   A       TRAV9D-4_GTGAGGGGGAAGGAGAGGCA_TRAJ37    False   347     GCTGTGAGGGGGAAGGAGAGGCAATACCGGAAAACTCATC
cell1   B       TRBV4_AGCTACAACTCCT_TRBJ2-7     True    334     GCCAGCAGCTACAACTCCTATGAACAGTAC
cell1   B       TRBV12-1_GCTCTACAACAGGGGGGGCACCG_TRBJ2-2        False   100     GCCAGCTCTACAACAGGGGGGGCACCGGGCAGCTCTAC

cell2   A       TRAV9D-4_GTGAGGGGGAAGGAGAGGCA_TRAJ37    False   347     N/A
cell2   A       TRAV4-2_TTGAGAATAA_TRAJ43       True    325     N/A
cell2   B       TRBV12-1_GCTCTACAACAGGGGGGG(C)ACCGG_TRBJ2-2     False   100     N/A
cell2   B       TRBV4_AGCTACAACTCCT_TRBJ2-7     True    334     N/A

cell3   A       TRAV7-5_TGAGCGACACC_TRAJ27      True    334     N/A
cell3   A       TRAV3-3_CAGTGGGGGAACTA_TRAJ26   False   339     N/A
cell3   B       TRBV31_AGTCTTGACACAAGA_TRBJ2-5  False   335     N/A
cell3   B       TRBV31_TGGAGCCCCGGGACAGGGCTCAACC_TRBJ1-5        True    343     N/A

Note that the code is backwards compatible: cell2 and cell3 in the test_data folder come pre-processed and consequently have N/A as the CDR3 sequence.

@mstubb
Copy link
Member

mstubb commented Aug 2, 2018

Thanks @dcroote! Couple of questions:

  1. Do we need to change the README to mandate downloading of the IgBLAST optional file for people who aren't using the Docker container?
  2. TraCeR doesn't use the IgBLAST databases that are distributed with IgBLAST. Instead, we built our own (albeit based on IMGT sequences) and distributed them in resources. One reason that we didn't include the optional_file in the original distribution is that I didn't take the time to confirm that the databases we built were compatible with the data in the optional_file. Are you able to confirm that this is the case?
  3. Are there any implications for people who build their own custom resource files (using tracer build) for which there is no data in the optional_file?

Thanks again. Happy to discuss any of these points.

Mike

@dcroote
Copy link
Contributor Author

dcroote commented Aug 2, 2018

Thanks for taking a look @mstubb ! Here are the answers:

  1. Yes good catch, I've added a commit to this PR updating the README
  2. I've looked into the TCR_J.fa files in the igblast_dbs folder and because the optional_file germline J gene start / CDR3 / frame sequence information uses the same IMGT sequence names as tracer, IgBLAST functions correctly.
  3. The impact should be minimal. I've confirmed that if the entire optional_file is missing, for example, tracer still runs successfully and the only difference is that the IgBLAST output lacks the CDR3 region chunk and consequently the tracer CDR3nt results are N/A. I've confirmed the same happens if, for whatever reason, the optional file directory is present but lacks the specific J genes used in the search database.

@mstubb
Copy link
Member

mstubb commented Aug 2, 2018

Looks great, thanks. One more thing has occurred to me.

At the moment, when the Recombinant class initialises it tries to find the CDR3 AA sequence (using _get_cdr3) by translating the DNA sequence and then using a fairly simplistic way of looking for the canonical motif. If it finds something, it stores it in self.cdr3. This is old vestigial code that ended up not being used. I think now would be a great time to throw that out and to just take the IgBLAST-reported CDR3 AA sequence and store that as self.cdr3. It would also be great to report that alongside the CDR3 DNA sequence in the outputs. Would you have time to implement that as well?

@dcroote
Copy link
Contributor Author

dcroote commented Aug 2, 2018

Sure! I'll have the changes done within the next couple days.

@dcroote
Copy link
Contributor Author

dcroote commented Aug 7, 2018

Should be all set! The amino acid CDR3 sequences are reported in the (un)filtered_TCR_seqs/(un)filtered_TCRs.txt outputs, e.g.:

#TCR_A#
##TRINITY_DN0_c0_g1_i3##
V segment:      TRAV4-2*02
J segment:      TRAJ43*01
ID:     TRAV4-2_TTGAGAATAA_TRAJ43
TPM:    187511.0
Productive:     True
Stop codon:     False
In frame:       True
CDR3aa: AVENNNNAPR
CDR3nt: GCTGTTGAGAATAACAACAATGCCCCACGA

as well as the recombinants.txt output, e.g.:

cell_name       locus   recombinant_id  productive      reconstructed_length    CDR3aa  CDR3nt
cell1   A       TRAV4-2_TTGAGAATAA_TRAJ43       True    325     AVENNNNAPR      GCTGTTGAGAATAACAACAATGCCCCACGA
cell1   A       TRAV9D-4_GTGAGGGGGAAGGAGAGGCA_TRAJ37    False   347     AVRGKERQYRKTH   GCTGTGAGGGGGAAGGAGAGGCAATACCGGAAAACTCATC
cell1   B       TRBV12-1_GCTCTACAACAGGGGGGGCACCG_TRBJ2-2        False   100     ASSTTGGAPGSST   GCCAGCTCTACAACAGGGGGGGCACCGGGCAGCTCTAC
cell1   B       TRBV4_AGCTACAACTCCT_TRBJ2-7     True    334     ASSYNSYEQY      GCCAGCAGCTACAACTCCTATGAACAGTAC

@mstubb
Copy link
Member

mstubb commented Aug 7, 2018

Cool, thanks! This looks great. I have a couple more thoughts / questions:

  1. Please can you update the test_data/expected_summary directory to contain the new output where cell1 gets CDR3 sequences. I've opened a new issue ( Regenerate test output with new CDR3 code #74) to remind me to do regenerate the test output for cell2 and cell3.

  2. What do you think we should do for cases where the sequence is non-productive (frame shift after the CDR3) but where a CDR3 can be found and translated. I'd like to do something to emphasise this in the output files because otherwise it might be a little confusing for people at first glance. I think the options are:

  • suppress reporting of the sequence (maybe unnecessary loss of data?)
  • somehow write the sequence differently. Maybe in lower-case and surrounded by brackets? eg [avrgkerqyrkth] to indicate it's different?

I'd be interested to hear your thoughts.

@dcroote
Copy link
Contributor Author

dcroote commented Aug 8, 2018

  1. Updated!
  2. That is an interesting case. I personally would leave the output as is because I am guessing most users are already filtering their results based on productive sequences, which they can do in this case using the productive column of the summary recombinants.txt data table. If you would like to add emphasis I would vote for lowercase over removing the sequence.

@lincoln-harris
Copy link

Hey Derek I've been testing this out on a set of ~150 cells, using your updated Docker image. The output looks exactly the same as the standard Tracer output, ie. I'm missing CDR3aa & CDR3nt columns in recombinants.txt. Is there an argument I have to pass to tracer assemble to tell it to output CDR3s?

@dcroote
Copy link
Contributor Author

dcroote commented Aug 8, 2018

Hmm. It is working on my end. @lincoln-harris did you pull the add-cdr3 build and run assemble / summarise with it? Running the following:

docker pull dcroote/tracer:add-cdr3

docker run -it -v $PWD:/scratch dcroote/tracer:add-cdr3 assemble -s Hsap /scratch/R1.fastq.gz /scratch/R2.fastq.gz testcell1 /scratch/cells

docker run -it -v $PWD:/scratch dcroote/tracer:add-cdr3 summarise -s Hsap /scratch/cells

I get:

cell_name       locus   recombinant_id  productive      reconstructed_length    CDR3aa  CDR3nt
testcell1       A       TRAV39_CCGTGTATACCGGAGG_TRAJ53  True    337     AVYTGGSNYKLT    GCCGTGTATACCGGAGGTAGCAACTATAAACTGACA
testcell1       B       TRBV5-1_ATCGGGAGGAGGGTACAA_TRBJ2-1      True    340     ASRSGGGYNEQF    GCCAGCAGATCGGGAGGAGGGTACAATGAGCAGTTC

@lincoln-harris
Copy link

lincoln-harris commented Aug 9, 2018

ah ok. I just did docker pull dcroote/tracer . Ill try it again.
UPDATE: working perfectly now. Thanks Derek!

@mstubb
Copy link
Member

mstubb commented Aug 14, 2018

Hi @dcroote,

I’m now on vacation so I’ll look at this properly when I’m back. But, i think lowercase for non-productive sequences would be great to help avoid any cognitive dissonance.

Thanks!

Mike

@dcroote
Copy link
Contributor Author

dcroote commented Aug 22, 2018

Sounds good- non-productive CDR3 sequences are now lowercase.

For (un)filtered_TCR_seqs/(un)filtered_TCRs.txt:

ID:     TRAV9D-4_GTGAGGGGGAAGGAGAGGCA_TRAJ37
TPM:    87776.7
Productive:     False
Stop codon:     False
In frame:       False
CDR3aa: avrgkerqyrkth
CDR3nt: gctgtgagggggaaggagaggcaataccggaaaactcatc

...

ID:     TRAV4-2_TTGAGAATAA_TRAJ43
TPM:    187511.0
Productive:     True
Stop codon:     False
In frame:       True
CDR3aa: AVENNNNAPR
CDR3nt: GCTGTTGAGAATAACAACAATGCCCCACGA

and for recombinants.txt:

cell_name       locus   recombinant_id  productive      reconstructed_length    CDR3aa  CDR3nt
cell1   A       TRAV9D-4_GTGAGGGGGAAGGAGAGGCA_TRAJ37    False   347     avrgkerqyrkth   gctgtgagggggaaggagaggcaataccggaaaactcatc
cell1   A       TRAV4-2_TTGAGAATAA_TRAJ43       True    325     AVENNNNAPR      GCTGTTGAGAATAACAACAATGCCCCACGA

@mstubb
Copy link
Member

mstubb commented Aug 31, 2018

Great thanks! Two more small things then I think we're ready to merge.

  1. Please update the README (this section https://github.com/Teichlab/tracer#output-1) to explain the new output format.

  2. Please check that the 'expected_summary` contents are consistent with the lowercase output.

Mike

@dcroote
Copy link
Contributor Author

dcroote commented Aug 31, 2018

Thanks, both updated!

@mstubb mstubb merged commit 6bc0182 into Teichlab:master Aug 31, 2018
@mstubb
Copy link
Member

mstubb commented Aug 31, 2018

Thanks @dcroote! This is awesome :)

@dcroote
Copy link
Contributor Author

dcroote commented Sep 1, 2018

My pleasure! Looks like the updates still need to be pushed to Docker Hub though.

@mstubb
Copy link
Member

mstubb commented Sep 1, 2018

Thanks for the reminder. Repo updates are supposed to trigger new docker builds automatically but we've never got it to work. I've manually started a build so it should be updated on Dockerhub soon.

@dcroote dcroote deleted the add-cdr3 branch August 30, 2020 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants