Skip to content

rfci md5 branch notes

Eric Nawrocki edited this page Jun 21, 2019 · 27 revisions

'Independent seeds': in rfci-md5 branch, seed sequences do not need to be in the Rfamseq-Genomes sequence database

Seed sequences S must have name in <name>/<start>-<end> format and satisfy one or more of:

  1. <name> exists in Rfamseq-Genomes and md5 matches.
  2. <name> exists as valid accession.version in GenBank and md5 matches.
  3. <name> exists as valid URS_taxid in RNAcentral and md5 matches.
  • where 'md5 matches' means md5 of sequence S is identical to md5 of subsequence of <name> from <start> to <end> from relevant DB. This is checked by fetching subsequence from DB and calculating md5, and comparing to calculated md5 of unaligned S from SEED file. Further for S RNAcentral matches, <start> must be 1 and <end> must be L where L is the length of the sequence (TODO: check that this is enforced in the code)

Changes to Perl code required to get info on seed sequences into rfmake output files (outlist, species, taxinfo etc.) files, so curators see seed seq info alongside Rfamseq-Genomes hits:

  1. SEEDTBLOUT and SEEDSCORES added to SVN files
  2. rfsearch.pl now searches fasta-ized version of SEED in addition to Rfamseq-Genomes and reversed db
  3. rfmake.pl now outputs SEEDTBLOUT and SEEDSCORES, containting seed seq taxonomic info from RfamLive rfamseq and taxonomy tables if S exists there, or fetched from GenBank DB and NCBI taxonomy DB if not. Bio::Rfam::FamilyIO::fetch_seed_sequence_info())

Changes to RfamLive updates by rfci.pl for family F to maintain information on seed sequences S with tax id (ncbi_id) T not in Rfamseq-Genomes:

  1. seed_regions table is updated by removing all entries for rfam_acc F and then adding them back (search->delete() then populate())
  2. rfamseq table is updated by adding new S with <name> not yet in table (populate()); S already in rfamseq are not modified
  3. taxonomy table is updated by adding new T(populate()); T already in taxonomy are not modified. *Importantly, T here is the tax id fetched from the GenBank xml for S (db_xref/taxon:<T>), and it is possible that when information for T is fetched from NCBI taxonomy that the db entry for T2 != T is fetched - this will occur if T has been merged with T2 in the NCBI taxonomy database (example as of writing: T=9478 and T2=1868482)
  4. rnacentral_matches is always updated for all seed seqs (update_or_create()) by checking via md5 if an exact match exists in RNAcentral

Updating families for 'independent seeds' approach:

'Independent seeds' requires addition of the SEEDTBLOUT and SEEDSCORES files to SVN, which are created by running rfsearch.pl and then rfmake.pl. Steps for doing that for each family:

  1. rfco.pl checkout family from SVN without SEEDSCORES and SEEDTBLOUT; this requires using a temporary subroutine: Bio::Rfam::FamilyIO::loadRfamFromLocalFile_preSEED to be called by rfco.pl so it doesn't try to check-out SEEDSCORES and SEEDTBLOUT which do not yet exist.
  2. run rfsearch.pl to get new TBLOUT, REVTBLOUT and SEEDTBLOUT files
  3. run rfmake.pl (using current GA??) to create SCORES and SEEDSCORES
  4. rfci.pl commit family (including SEEDSCORES and SEEDTBLOUT) and have it update RfamLive tables seed_region, taxonomy, rfamseq, and rnacentral_matches as above. Verify that this updates tables in RfamLive correctly.

After updating all families:

  1. Remove Bio::Rfam::FamilyIO::loadRfamFromLocalFile_preSEED subroutine and replace calls to it with calls to loadRfamFromLocalFile.
  2. Checkout some test families and
    • Verify rfco.pl checks out SEEDSCORES and SEEDTBLOUT
    • Verify rfmake.pl makes seedoutlist and seedspecies without needing to run rfsearch.pl first

Notes/Questions

  1. Do we use current GA when rerunning rfmake.pl for all families?
  2. I think it makes sense to test it all out by updating a few families all the way, before embarking on all families. This will require using the loadRfamFromLocalFile_preSEED for those test families, and then removing it to test rfco.pl after the update.
  3. At what point do we merge the rfci-md5 branch with master?
  4. How do we make sure RfamLive is not screwed up? At what point do I stop testing on Rfam-dev?

TODO list from 5/28/19 slack call

  • step 1: Eric runs >=1 existing family rfco/rfsearch/rfmake/rfci on rfci-md5 branch and his environment
  • step 2: Anton runs >=1 existing family on Anton and Ioanna's environment with rfci-seed branch and also tries to build a family (IS THIS STEP NECESSARY?)
  • step 3: Ioanna merge rfci-md5 branch into master (also possibly merges rfam-cloud)
  • step 4: Eric runs >= 1 existing family rfco/rfsearch/rfmake/rfci with new master
  • step 5: Anton runs >= 1 existing and >=1 new family rfco/rfsearch/rfmake/rfci with new master
  • step 6: Ioanna runs all existing families rfco/rfsearch/rfmake/rfci

Instructions for updating a family (started 06/02/19):

  • make sure you are using rfci-md5 branch (which rfco.pl should return path to local copy of rfci-md5 branch)
  • Eric's local md5 branch is here: /homes/nawrocki/src/rfam-family-pipeline (it's possible I have some local files that have not been added to the git repo that you may need to copy over)
  • make sure your RFAM_CONFIG environment variable points to a copy with SEEDTBLOUT and SEEDSCORES added to files->family_file list (Eric's rfam.conf and rfam_local.conf he used to successfully check in families are in this dir: /nfs/production/xfam/users/nawrocki/notebook/19_0318_rfam_seedoutlist/)
  • Choose a family RFXXXXX to update
  • checkout the family and move into that directory: rfco.pl RFXXXXX; cd RFXXXXX;
  • connect to RfamLive and run the following shell script (/nfs/production/xfam/users/nawrocki/notebook/19_0318_rfam_seedoutlist/testing/20190529/run-mysql.sh) that executes 4 MySQL queries with a command like sh run-mysql.sh RFXXXXX (TODO: improve these queries):
#!/bin/bash
if [ $# -eq 0 ]; then
    echo "usage: run-mysql.sh <rfam-accession>"
    exit 1
fi
# seed_region query 1
mysql --user rfamro --host mysql-rfam-live.ebi.ac.uk --port 4445 --database rfam_live -B -e "select * from seed_region where rfam_acc = \"$1\";" > $1.sr.mysql.txt
# rfamseq query 1
mysql --user rfamro --host mysql-rfam-live.ebi.ac.uk --port 4445 --database rfam_live -B -e "select s.rfam_acc, s.rfamseq_acc, r.ncbi_id  from seed_region s join rfamseq r on r.rfamseq_acc = s.rfamseq_acc where s.rfam_acc = \"$1\";" > $1.rs.mysql.txt
# taxonomy query 1
mysql --user rfamro --host mysql-rfam-live.ebi.ac.uk --port 4445 --database rfam_live -B -e "select s.rfam_acc, s.rfamseq_acc, t.ncbi_id, t.align_display_name from seed_region s join rfamseq r on r.rfamseq_acc = s.rfamseq_acc join taxonomy t on t.ncbi_id = r.ncbi_id where s.rfam_acc = \"$1\";" > $1.tx.mysql.txt
# rnacentral_matches query 1
mysql --user rfamro --host mysql-rfam-live.ebi.ac.uk --port 4445 --database rfam_live -B -e "select s.rfam_acc, c.seq_start, c.seq_end, c.type, r.rfamseq_acc, r.ncbi_id, c.rnacentral_id from seed_region s join rfamseq r on r.rfamseq_acc = s.rfamseq_acc join rnacentral_matches c on c.rfamseq_acc = r.rfamseq_acc where s.rfam_acc = \"$1\"" > $1.rc.mysql.txt

Run rfsearch/rfmake/svn-add/rfci from within the checked out family dir:

  $ rfsearch.pl > rfsearch.out
  # or 
  $ bsub -q research-rh7 -J RFXXXXX -o /dev/null -e RFXXXXX.err "rfsearch.pl > rfsearch.out"
  # after rfsearch.pl finishes, run enforce current GA threshold in DESC with rfmake.pl
  $ rfmake.pl
  # to enforce current threshold in `DESC`
  $ svn add SEEDTBLOUT SEEDSCORES
  $ cd ..; rfci.pl -m "Updates family to independent seed paradigm" -preseed RFXXXXX
  # The -preseed option makes rfci.pl use a temporary subroutine 
  # (FamilyIO::loadRfamFromSVN_preSEED() which calls 
  # FamilyIO::loadRfamFromLocalFile_preSEED) that does not expect 
  # SEEDTBLOUT and SEEDSCORES to be in the SVN repo yet (which they
  # won't be since we are about to add them with this call to rfci.pl)
  #
  # Run RfamLive queries from above and make sure output matches/differs as expected.

Changes to make to code after all families have been updated to 'independent seed' paradigm

  1. Remove FamilyIO::loadRfamFromSVN_preSEED() and FamilyIO::loadRfamFromLocalFile_preSEED subroutines and calls (in rfci.pl and jiffies/update_seed_region_for_family.pl
  2. THIS LIST MAY NOT BE COMPLETE