Skip to content

FCS GX quickstart

Eric Tvedte edited this page Apr 17, 2024 · 5 revisions

FCS-GX detects contamination from foreign organisms in genome sequences. This tool is one module within the NCBI Foreign Contamination Screening (FCS) program suite.

We recommend running FCS-GX after the initial contig assembly and on the final assembly prior to GenBank submission. If additional valid contaminants are identified in the final assembly, we recommend re-screening after contaminant removal.

FCS-GX operates in six main steps:

  1. Repeat and low-complexity sequence masking
  2. Alignment to reference database using GX aligner
  3. Alignment refinement with high-scoring taxa matches
  4. Classifying sequences to assign taxonomic divisions
  5. Generating contaminant cleaning actions
  6. Clean the genome

Quickstart

Prerequisites

  1. Docker or Singularity The current Singularity image is made using version 3.4.0.
  2. Python 3.7+.
  3. 470 GiB of disk space to save a local copy of the database files.
  4. A host with 512 GiB shared memory to hold the database and accessory files. Execution can utilize up to 48 CPU cores. Not running on a large-RAM server will result in extremely long run times (as much as a 10000x difference in performance).
  5. A genome assembly in FASTA format.
  6. The tax-id of the organism.

Note: FCS-GX can be run in AWS or GCP. Please see Amazon Web Services wiki or Google Cloud wiki to get started on creating a VM with Docker. Visit ncbi/fcs-gx repo for source code.

Download FCS-GX

  1. Retrieve the fcs.py runner script:

    curl -LO https://github.com/ncbi/fcs/raw/main/dist/fcs.py
    

    Docker is the default image, and will be automatically downloaded and used by the runner script.

  2. For Singularity users:
    Retrieve the Singularity image file fcs-gx.sif and set the environment variable to use the image with the runner:

    curl https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/releases/latest/fcs-gx.sif -Lo fcs-gx.sif
    export FCS_DEFAULT_IMAGE=fcs-gx.sif
    

    To see the version of your Singularity image, you can run:

    singularity inspect fcs-gx.sif  
    

Download the FCS-GX database

  1. Download the db (for a slower alternative, see FCS-GX input):

    curl -LO https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz
    tar -xvf s5cmd_2.0.0_Linux-64bit.tar.gz
    
    LOCAL_DB="/path/to/db/folder"
    ./s5cmd  --no-sign-request cp  --part-size 50  --concurrency 50 s3://ncbi-fcs-gx/gxdb/latest/all.* $LOCAL_DB
    
  2. Check if the database is downloaded successfully to $LOCAL_DB:

    ls "$LOCAL_DB/gxdb"
    
    all.README.txt
    all.assemblies.tsv
    all.blast_div.tsv.gz
    all.gxi
    all.gxs
    all.manifest
    all.meta.jsonl
    all.seq_info.tsv.gz
    all.taxa.tsv
    
  3. If you have access to a tmpfs- or ramfs-backed filesystem, e.g., /dev/shm, you can copy the downloaded databases to RAM to ensure it is available in successive runs on the same server.

    sudo mkdir /my_tmpfs
    sudo mount -t tmpfs tmpfs /my_tmpfs -o size=470G
    
    python3 fcs.py db get --mft "$LOCAL_DB/gxdb/all.manifest" --dir /my_tmpfs/gxdb
    
  4. Check if there are any differences between the source 'all' db and the downloaded 'all' db. If you have access to a tmpfs- or ramfs-backed filesystem, you can also check if there are any differences between the downloaded 'all' db and the cached 'all' db:

    python3 fcs.py db check --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/gxdb"
    python3 fcs.py db check --mft "$LOCAL_DB/gxdb/all.manifest" --dir /my_tmpfs/gxdb
    

Screen the genome

  1. Assign the path to the --gx-db folder to GXDB_LOC.
  • If you are using the db stored on local disk: GXDB_LOC=/path/to/db/folder
  • If you are using the db stored in RAM: GXDB_LOC=/my_tmpfs
  1. Retrieve the organism tax-id from NCBI Taxonomy.
  2. Screen the genome:
    python3 ./fcs.py screen genome --fasta h_sapiens.fa.gz --out-dir ./gx_out/ --gx-db "$GXDB_LOC/gxdb" --tax-id 9606 
    

Clean the genome

  1. Perform cleaning actions on input genome:

    zcat h_sapiens.fa.gz | python3 ./fcs.py clean genome --action-report ./gx_out/h_sapiens.fa.9606.fcs_gx_report.txt --output clean.fasta --contam-fasta-out contam.fasta
    
  2. Split on internal contaminants instead of masking:

    sed -i 's/FIX/SPLIT/g' ./gx_out/h_sapiens.fa.9606.fcs_gx_report.txt
    
    zcat h_sapiens.fa.gz | python3 ./fcs.py clean genome --action-report ./gx_out/h_sapiens.fa.9606.fcs_gx_report.txt --output clean.fasta --contam-fasta-out contam.fasta
    

Usage Examples

Test that FCS-GX is operating normally on a small FASTA file.

  1. Download the test FASTA:

    curl -LO https://zenodo.org/records/10932013/files/FCS_combo_test.fa
    
  2. Screen the genome:

    python3 ./fcs.py screen genome --fasta FCS_combo_test.fa --out-dir ./gx_out/ --gx-db /my_tmpfs/gxdb --tax-id 4932
    

    A successful FCS-GX run will print the parameters of the run, sequence masking progress, and a contamination summary report:

    -----------------------------------------------------------------------------
    
    tax-id    : 4932
    fasta     : /sample-volume/FCS_combo_test.fa
    size      : 12.18 MiB
    split-fa  : True
    BLAST-div : budding yeasts
    gx-div    : fung:budding yeasts
    w/same-tax: True
    bin-dir   : /app/bin
    gx-db     : /app/db/gxdb/gxdb/all.gxi
    gx-ver    : Nov 27 2023 12:29:26; git:v0.5.0
    output    : /output-volume//FCS_combo_test.4932.taxonomy.rpt
    
    -----------------------------------------------------------------------------
    
    Collecting masking statistics...
    Collected masking stats:  0.0125624 Gbp; 3.36762s; 3.73035 Mbp/s. Baseline: 1.04906
    
    Processed 420 queries, 12.5732Mbp in 4.35433s. (2.88751Mbp/s); num-jobs:120
    Species                    : None
    Asserted div               : fung:budding yeasts
    Inferred primary-divs      : ['fung:budding yeasts', 'fung:ascomycetes']
    Corrected primary-divs     : ['fung:budding yeasts', 'fung:ascomycetes']
    Putative contaminant divs  : ['prok:g-proteobacteria', 'anml:primates']
    Aggregate coverage         : 100%
    Minimum contam. coverage   : 20%
    
    -----------------------------------------------------------------------------
    
    fcs_gx_report.txt contamination summary:
    ----------------------------------------
                                    seqs      bases
                                ----- ----------
    TOTAL                            405     404339
    -----                          ----- ----------
    prok:g-proteobacteria            202     201923
    anml:primates                    201     200894
    virs:eukaryotic viruses            1       1000
    anml:nematodes                     1        522
    
    -----------------------------------------------------------------------------
    
    fcs_gx_report.txt action summary:
    ---------------------------------
                                    seqs      bases
                                ----- ----------
    TOTAL                            405     404339
    -----                          ----- ----------
    EXCLUDE                          401     400522
    FIX                                2       1922
    TRIM                               2       1895
    
    -----------------------------------------------------------------------------
    

    The output directory will contain the following files:

    FCS_combo_test.4932.taxonomy.rpt
    FCS_combo_test.4932.fcs_gx_report.txt
    

    The output should be similar to the examples for the taxonomy report taxonomy.rpt and action report fcs_gx_report.txt. Note: Minor differences in output content are expected with code and database changes. See the FCS-GX Output page for additional information regarding interpreting outputs.

  3. Clean the genome:

    zcat fcsgx_test.fa.gz | python3 ./fcs.py clean genome --action-report ./gx_out/fcsgx_test.fa.6973.fcs_gx_report.txt --output clean.fasta --contam-fasta-out contam.fasta
    

    By default this will exclude 401 sequences (EXCLUDE in action report), trim 2 sequences (TRIM), and hardmask 2 sequences at internal contaminants (FIX):

    Applied 405 actions; 402417 bps dropped; 1922 bps hardmasked.
    
Clone this wiki locally