SARG+ is a manually curated database of Antibiotic Resistance Genes (ARGs), designed to enhance read-based environmental surveillance at species-level resolution. It extends existing databases (NDARO, CARD, and SARG) by incorporating a comprehensive collection of protein sequences from RefSeq that are annotated through the same evidence sources (BlastRules or Hidden Markov Models provided by the NCBI Prokaryotic Genome Annotation Pipeline, PGAP) as experimentally validated ARGs. This expansion addresses the limitations of existing databases, which often include only a single or a few representative sequences per ARG, and allows for the use of more stringent cutoffs while maintaining sensitivity.
SARG+ (sarg.fa) consists of two components:
- SARG+ reference (sarg_ref.fa): Contains only experimentally validated sequences.
- SARG+ extension (sarg_ext.fa): Contains computationally derived homologs of the reference sequences.
Create a new conda environment with the necessary dependencies:
conda create -n sarg-curation -c bioconda -c conda-forge blast diamond mmseqs2 seqkit
conda activate sarg-curation
Install additional Python modules and Jupyter for running the notebooks:
conda install jupyter regex json5 wget tqdm biopython pandas
Note
Extracting sequences and creating diamond
databases can be time-consuming. Consider running blastdbcmd
and diamond makedb
in parallel to utilize multiple cores.
Download nr
and env_nr
from NCBI FTP and extract sequences:
for db in nr env_nr
do
curl ftp://ftp.ncbi.nlm.nih.gov/blast/db/ \
| grep -o '[^ ]*.gz$' \
| grep ^$db \
| xargs -P 48 -I {} wget -qN --show-progress https://ftp.ncbi.nlm.nih.gov/blast/db/{} -P tmp/protein/$db
for file in tmp/protein/$db/*.tar.gz; do tar -xvf $file -C tmp/protein/$db; done
done
blastdbcmd -db tmp/protein/nr/nr -entry all > tmp/nr.fa
blastdbcmd -db tmp/protein/env_nr/env_nr -entry all > tmp/env_nr.fa
rm -rf tmp/protein
Download refseq_protein
from NCBI RefSeq and extract annotation evidence:
for kingdom in archaea bacteria
do
curl ftp://ftp.ncbi.nlm.nih.gov/refseq/release/${kingdom}/ \
| grep '[^ ]*wp_protein.*.gz$' -o \
| xargs -P 48 -I {} wget -qN --show-progress https://ftp.ncbi.nlm.nih.gov/refseq/release/${kingdom}/{} -P tmp/refseq
done
cat tmp/refseq/*.faa.gz > tmp/refseq_protein.faa.gz
diamond makedb --in tmp/refseq_protein.faa.gz --db tmp/refseq_protein
rm -rf tmp/refseq_protein.faa.gz
python -c "
import gzip
import glob
import pandas as pd
from Bio import SeqIO
from tqdm.contrib.concurrent import process_map
def parser(file):
lines = []
with gzip.open(file, 'rt') as f:
for record in SeqIO.parse(f, 'genbank'):
header = record.description.rsplit(' [')[0].split('MULTISPECIES: ')[-1]
if 'structured_comment' in record.annotations:
annotation = record.annotations['structured_comment']['Evidence-For-Name-Assignment']
identifier, evidence = annotation.get('Source Identifier', 'NA'), annotation.get('Evidence Accession', 'NA')
else:
identifier, evidence = 'NA', 'NA'
gene = [x for x in record.features if x.type == 'gene']
symbol = gene[0].qualifiers['gene'][0] if gene else 'NA'
lines.append([record.id, header, evidence, symbol, identifier])
pd.DataFrame(lines).to_csv(file.replace('.gpff.gz', '.tsv'), sep='\t', header=None, index=False)
r = process_map(parser, glob.glob('tmp/refseq/*.gpff.gz'), max_workers=48, chunksize=1)
"
NDARO need to be downloaded manually from https://www.ncbi.nlm.nih.gov/pathogens/refgene/ (click Download
for both the metadata refgenes.tsv
and the reference protein sequences protein.faa
). CARD can be obtained from https://card.mcmaster.ca/download. These files need to be unzipped and placed in the reference folder.
We provide a series of Jupyter notebooks for step-wise construction of SARG+:
a0-parse-refs.ipynb
- Parses NDARO and CARD metadata and sequences to create a raw reference. Curates the reference according to
sarg.json
, producing the initial SARG+ reference database.
- Parses NDARO and CARD metadata and sequences to create a raw reference. Curates the reference according to
a1-standardize-headers.ipynb
- Standardizes the headers of SARG+ reference sequences according to
nr
.
- Standardizes the headers of SARG+ reference sequences according to
b0-parse-evidence.ipynb
- Finds sequences annotated through the same evidence sources (BlastRules and Hidden Markov Models) as SARG+ reference sequences.
b1-remove-dups.ipynb
- Removes duplicated and cross-mapped sequences by clustering.
Note
Scripts z0
and z1
are used for testing and are not required for building SARG+.
- Summary File:
misc/summary.tsv
provides a summary of SARG+ reference sequences, including their types, subtypes, and sources. - Evidence File:
misc/evidence.tsv
lists all evidence used for creating SARG+ extension. - Counts:
misc/sarg.txt
,misc/sarg_ref.txt
, andmisc/sarg_ext.txt
display the counts of different ARGs for SARG+, SARG+ reference, and SARG+ extension, respectively. - Sequences:
sarg.fa
: combined reference and extension sequences.sarg_ref.fa
: the reference component of SARG+.sarg_ext.fa
: the extension component of SARG+.
New ARG sequences can be integrated into SARG+ by editing sarg.json
and reference/reference.fasta
.
sarg.json
: Specifies the ARG type (class/family) for each new gene, relevant literature, rationales, and links to sequences.reference.fasta
: Contains the protein sequences of these new genes.
For example:
// aminonucleoside
"aminonucleoside": {
"added": {
// "Molecular analysis of the pac gene encoding a puromycin N-acetyl transferase from Streptomyces alboniger"
// Detoxification of puromycin.
// https://www.uniprot.org/citations/2676728
"sp|P13249|PUAC_STRAD": "pac", // Puromycin N-acetyltransferase OS=Streptomyces alboniger OX=132473 GN=pac PE=1 SV=2 | GNAT family N-acetyltransferase
// "The pur8 gene from the pur cluster of Streptomyces alboniger encodes a highly hydrophobic polypeptide which confers resistance to puromycin"
// May be involved in active puromycin efflux energized by a proton-dependent electrochemical gradient. In addition, it could be implicated in secreting N-acetylpuromycin, the last intermediate of the puromycin biosynthesis pathway, to the environment.
// https://www.uniprot.org/citations/7916693
"sp|P42670|PUR8_STRAD": "pur8", // Puromycin resistance protein pur8 OS=Streptomyces alboniger OX=132473 GN=pur8 PE=3 SV=1 | MFS transporter
// "The ard1 gene from Streptomyces capreolus encodes a polypeptide of the ABC-transporters superfamily which confers resistance to the aminonucleoside antibiotic A201A"
// The gene ard1 induced antibiotic resistance that was highly specific for A201A.
// https://www.ncbi.nlm.nih.gov/nuccore/X84374
"CAA59109.1": "ard1", // Ard1 protein [Saccharothrix mutabilis subsp. capreolus] | Ard1 protein
// "The aminonucleoside antibiotic A201A is inactivated by a phosphotransferase activity from Streptomyces capreolus NRRL 3817, the producing organism. Isolation and molecular characterization of the relevant encoding gene and its DNA flanking regions"
// A novel resistance determinant (ard2) to the aminonucleoside antibiotic A201A was cloned from Streptomyces capreolus NRRL 3817, the producing organism, and expressed in Streptomyces lividans.
// https://www.ncbi.nlm.nih.gov/nuccore/X84374
"CAD62197.1": "ard2", // Ard2 protein [Saccharothrix mutabilis subsp. capreolus] | Ard2 protein
}
}
If you spot any suspicious cases or wish to add sequences to SARG+, please consider creating a pull request by editing sarg.json
and reference/reference.fasta
, or opening an issue.
No, we exclude all ARGs related to point mutations in essential genes (mainly antibiotic targets). Examples include mutations in gyrA, parC, and rpoB.
We group highly similar ARG subtypes (genes) into clusters to reduce the chance of false identifications. For instance, blaOXA-1 and blaOXA-1024 differ by a single amino acid, and this subtle difference can be difficult to detect using reads. By default, we apply 95% identity and 95% query/subject cover as cutoffs for subtype clustering.
SARG+ is designed for read-based ARG profiling. Fused genes can create ambiguities when being aligned using reads. For example, WP_071593228.1 (catB/aac(6')-I) likely results from the fusion of WP_264840997.1 (catB) and WP_033917551.1 (aac(6')-I). Reads, especially short ones, may not reliably distinguish between these genes, potentially leading to false identifications.
We omit regulators (e.g., activators, repressors) since they do not confer direct antibiotic resistance (with the exceptions of tipA and albAB, which function as self-regulated sequesters). We also remove genes that are putatively accessory, such as vanZ.
We standardize gene names to ensure all of them are identifiable through a combination of ARG type and subtype. For example:
- mdtP refers to both a component of RND efflux pumps in Escherichia coli and an MFS transporter in Bacillus subtilis. To avoid confusion, we rename the Bacillus version to mdt(P), aligning it with mdt(A) (also an MFS transporter).
- Some genes are mislabeled by RefSeq; for example, efrCD is misspelled as erfCD. We correct such cases.
- qacA and qacB are renamed to qacA/B due to their high sequence similarity.
Maintaining naming consistency is a key goal of SARG+, so we modify some gene names accordingly. For instance:
- tnrB2 and tnrB3 are renamed to tnrB-1 and tnrB-2 to reflect their two-component nature as ABC transporters.
- cap21 refers to orf21 of a biosynthetic gene cluster (AB476988) in Streptomyces griseus, which lacks a proper name.