-
Notifications
You must be signed in to change notification settings - Fork 2
rfci md5 branch notes
Eric Nawrocki edited this page Jun 21, 2019
·
27 revisions
'Independent seeds': in rfci-md5 branch, seed sequences do not need to be in the Rfamseq-Genomes sequence database
-
<name>
exists in Rfamseq-Genomes and md5 matches. -
<name>
exists as validaccession.version
in GenBank and md5 matches. -
<name>
exists as validURS_taxid
in RNAcentral and md5 matches.
- where 'md5 matches' means md5 of sequence
S
is identical to md5 of subsequence of<name>
from<start>
to<end>
from relevant DB. This is checked by fetching subsequence from DB and calculating md5, and comparing to calculated md5 of unalignedS
fromSEED
file. Further forS
RNAcentral matches,<start>
must be1
and<end>
must beL
whereL
is the length of the sequence (TODO: check that this is enforced in the code)
Changes to Perl code required to get info on seed sequences into rfmake output files (outlist
, species
, taxinfo
etc.) files, so curators see seed seq info alongside Rfamseq-Genomes hits:
-
SEEDTBLOUT
andSEEDSCORES
added to SVN files -
rfsearch.pl
now searches fasta-ized version ofSEED
in addition to Rfamseq-Genomes and reversed db -
rfmake.pl
now outputsSEEDTBLOUT
andSEEDSCORES
, containting seed seq taxonomic info from RfamLiverfamseq
andtaxonomy
tables ifS
exists there, or fetched from GenBank DB and NCBI taxonomy DB if not.Bio::Rfam::FamilyIO::fetch_seed_sequence_info()
)
Changes to RfamLive updates by rfci.pl
for family F
to maintain information on seed sequences S
with tax id (ncbi_id
) T
not in Rfamseq-Genomes:
-
seed_regions
table is updated by removing all entries forrfam_acc F
and then adding them back (search->delete()
thenpopulate()
) -
rfamseq
table is updated by adding newS
with<name>
not yet in table (populate()
);S
already inrfamseq
are not modified -
taxonomy
table is updated by adding newT
(populate()
);T
already intaxonomy
are not modified. *Importantly,T
here is the tax id fetched from the GenBank xml forS
(db_xref/taxon:<T>
), and it is possible that when information forT
is fetched from NCBI taxonomy that the db entry forT2 != T
is fetched - this will occur ifT
has been merged withT2
in the NCBI taxonomy database (example as of writing:T=9478
andT2=1868482
) -
rnacentral_matches
is always updated for all seed seqs (update_or_create()
) by checking via md5 if an exact match exists in RNAcentral
'Independent seeds' requires addition of the SEEDTBLOUT
and SEEDSCORES
files to SVN, which are created by running rfsearch.pl
and then rfmake.pl
. Steps for doing that for each family:
-
rfco.pl
checkout family from SVN withoutSEEDSCORES
andSEEDTBLOUT
; this requires using a temporary subroutine:Bio::Rfam::FamilyIO::loadRfamFromLocalFile_preSEED
to be called byrfco.pl
so it doesn't try to check-outSEEDSCORES
andSEEDTBLOUT
which do not yet exist. - run
rfsearch.pl
to get newTBLOUT
,REVTBLOUT
andSEEDTBLOUT
files - run
rfmake.pl
(using current GA??) to createSCORES
andSEEDSCORES
- rfci.pl commit family (including
SEEDSCORES
andSEEDTBLOUT
) and have it update RfamLive tablesseed_region
,taxonomy
,rfamseq
, andrnacentral_matches
as above. Verify that this updates tables in RfamLive correctly.
- Remove
Bio::Rfam::FamilyIO::loadRfamFromLocalFile_preSEED
subroutine and replace calls to it with calls toloadRfamFromLocalFile
. - Checkout some test families and
- Verify
rfco.pl
checks outSEEDSCORES
andSEEDTBLOUT
- Verify
rfmake.pl
makesseedoutlist
andseedspecies
without needing to runrfsearch.pl
first
- Verify
- Do we use current GA when rerunning
rfmake.pl
for all families? - I think it makes sense to test it all out by updating a few families all the way, before embarking on all families. This will require using the
loadRfamFromLocalFile_preSEED
for those test families, and then removing it to testrfco.pl
after the update. - At what point do we merge the
rfci-md5
branch withmaster
? - How do we make sure RfamLive is not screwed up? At what point do I stop testing on Rfam-dev?
- step 1: Eric runs >=1 existing family rfco/rfsearch/rfmake/rfci on rfci-md5 branch and his environment
- step 2: Anton runs >=1 existing family on Anton and Ioanna's environment with rfci-seed branch and also tries to build a family (IS THIS STEP NECESSARY?)
- step 3: Ioanna merge rfci-md5 branch into master (also possibly merges rfam-cloud)
- step 4: Eric runs >= 1 existing family rfco/rfsearch/rfmake/rfci with new master
- step 5: Anton runs >= 1 existing and >=1 new family rfco/rfsearch/rfmake/rfci with new master
- step 6: Ioanna runs all existing families rfco/rfsearch/rfmake/rfci
- make sure you are using rfci-md5 branch (
which rfco.pl
should return path to local copy of rfci-md5 branch) - Eric's local md5 branch is here:
/homes/nawrocki/src/rfam-family-pipeline
(it's possible I have some local files that have not been added to the git repo that you may need to copy over) - make sure your
RFAM_CONFIG
environment variable points to a copy withSEEDTBLOUT
andSEEDSCORES
added tofiles->family_file
list (Eric'srfam.conf
andrfam_local.conf
he used to successfully check in families are in this dir:/nfs/production/xfam/users/nawrocki/notebook/19_0318_rfam_seedoutlist/
) - Choose a family RFXXXXX to update
- checkout the family and move into that directory:
rfco.pl RFXXXXX; cd RFXXXXX;
- connect to RfamLive and run the following shell script (
/nfs/production/xfam/users/nawrocki/notebook/19_0318_rfam_seedoutlist/testing/20190529/run-mysql.sh
) that executes 4 MySQL queries with a command likesh run-mysql.sh RFXXXXX
(TODO: improve these queries):
#!/bin/bash
if [ $# -eq 0 ]; then
echo "usage: run-mysql.sh <rfam-accession>"
exit 1
fi
# seed_region query 1
mysql --user rfamro --host mysql-rfam-live.ebi.ac.uk --port 4445 --database rfam_live -B -e "select * from seed_region where rfam_acc = \"$1\";" > $1.sr.mysql.txt
# rfamseq query 1
mysql --user rfamro --host mysql-rfam-live.ebi.ac.uk --port 4445 --database rfam_live -B -e "select s.rfam_acc, s.rfamseq_acc, r.ncbi_id from seed_region s join rfamseq r on r.rfamseq_acc = s.rfamseq_acc where s.rfam_acc = \"$1\";" > $1.rs.mysql.txt
# taxonomy query 1
mysql --user rfamro --host mysql-rfam-live.ebi.ac.uk --port 4445 --database rfam_live -B -e "select s.rfam_acc, s.rfamseq_acc, t.ncbi_id, t.align_display_name from seed_region s join rfamseq r on r.rfamseq_acc = s.rfamseq_acc join taxonomy t on t.ncbi_id = r.ncbi_id where s.rfam_acc = \"$1\";" > $1.tx.mysql.txt
# rnacentral_matches query 1
mysql --user rfamro --host mysql-rfam-live.ebi.ac.uk --port 4445 --database rfam_live -B -e "select s.rfam_acc, c.seq_start, c.seq_end, c.type, r.rfamseq_acc, r.ncbi_id, c.rnacentral_id from seed_region s join rfamseq r on r.rfamseq_acc = s.rfamseq_acc join rnacentral_matches c on c.rfamseq_acc = r.rfamseq_acc where s.rfam_acc = \"$1\"" > $1.rc.mysql.txt
Run rfsearch/rfmake/svn-add/rfci
from within the checked out family dir:
$ rfsearch.pl > rfsearch.out
# or
$ bsub -q research-rh7 -J RFXXXXX -o /dev/null -e RFXXXXX.err "rfsearch.pl > rfsearch.out"
# after rfsearch.pl finishes, run enforce current GA threshold in DESC with rfmake.pl
$ rfmake.pl
# to enforce current threshold in `DESC`
$ svn add SEEDTBLOUT SEEDSCORES
$ cd ..; rfci.pl -m "Updates family to independent seed paradigm" -preseed RFXXXXX
# The -preseed option makes rfci.pl use a temporary subroutine
# (FamilyIO::loadRfamFromSVN_preSEED() which calls
# FamilyIO::loadRfamFromLocalFile_preSEED) that does not expect
# SEEDTBLOUT and SEEDSCORES to be in the SVN repo yet (which they
# won't be since we are about to add them with this call to rfci.pl)
#
# Run RfamLive queries from above and make sure output matches/differs as expected.
- Remove
FamilyIO::loadRfamFromSVN_preSEED()
andFamilyIO::loadRfamFromLocalFile_preSEED
subroutines and calls (inrfci.pl
andjiffies/update_seed_region_for_family.pl
- THIS LIST MAY NOT BE COMPLETE