Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPVC-2379: add necessary NCBI input files to download config #25

Merged
merged 5 commits into from
Apr 24, 2024

Conversation

bsgiles73
Copy link

@bsgiles73 bsgiles73 commented Apr 18, 2024

This PR:

  • Moves the paths to needed NCBI files to it's own config
  • Added the missing files and other genome builds
  • Updates download-ncbi script to pull paths from config file
  • Updates docker-compose.yml to pass config file parameter

To test these changes I performed the following...

docker build --target uta -t uta-update .

mkdir ncbi

export UTA_ETL_NCBI_DIR=./ncbi

docker compose run ncbi-download

docker compose output...

Downloading files to /ncbi-dir
Downloading ftp.ncbi.nlm.nih.gov::gene/DATA/gene2refseq.gz to /ncbi-dir/gene
receiving incremental file list
DATA/
DATA/gene2refseq.gz
  1,700,651,030 100%   12.73MB/s    0:02:07 (xfr#1, to-chk=0/2)

sent 51 bytes  received 1,701,066,362 bytes  13,034,991.67 bytes/sec
total size is 1,700,651,030  speedup is 1.00
Downloading ftp.ncbi.nlm.nih.gov::gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz to /ncbi-dir/gene
receiving incremental file list
DATA/
DATA/GENE_INFO/
DATA/GENE_INFO/Mammalia/
DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
      5,120,527 100%    4.81MB/s    0:00:01 (xfr#1, to-chk=0/4)

sent 59 bytes  received 5,121,994 bytes  1,138,234.00 bytes/sec
total size is 5,120,527  speedup is 1.00
...

The resulting output took 8.8Gb of disk space and in the output directory had the following structure...

sgiles-MD6M:uta shane.giles$ tree ncbi
ncbi
├── gene
│   └── DATA
│       ├── GENE_INFO
│       │   └── Mammalia
│       │       └── Homo_sapiens.gene_info.gz
│       └── gene2refseq.gz
├── genomes
│   └── refseq
│       └── vertebrate_mammalian
│           └── Homo_sapiens
│               └── all_assembly_versions
│                   ├── GCF_000001405.25_GRCh37.p13
│                   │   ├── GCF_000001405.25_GRCh37.p13_assembly_report.txt
│                   │   ├── GCF_000001405.25_GRCh37.p13_genomic.fna.gz
│                   │   └── GCF_000001405.25_GRCh37.p13_genomic.gff.gz
│                   ├── GCF_000001405.40_GRCh38.p14
│                   │   ├── GCF_000001405.40_GRCh38.p14_assembly_report.txt
│                   │   ├── GCF_000001405.40_GRCh38.p14_genomic.fna.gz
│                   │   └── GCF_000001405.40_GRCh38.p14_genomic.gff.gz
│                   └── GCF_009914755.1_T2T-CHM13v2.0
│                       ├── GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt
│                       ├── GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
│                       └── GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
└── refseq
    └── H_sapiens
        └── mRNA_Prot
            ├── human.1.protein.faa.gz
            ├── human.1.rna.fna.gz
            ├── human.1.rna.gbff.gz
            ├── human.10.protein.faa.gz
            ├── human.10.rna.fna.gz
            ├── human.10.rna.gbff.gz
            ├── human.11.protein.faa.gz
            ├── human.11.rna.fna.gz
            ├── human.11.rna.gbff.gz
...

UPDATE
I verified that if a file is not found by rsync, the script exits with an error code.

Downloading ftp.ncbi.nlm.nih.gov::genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.40_GRCh38.p14/RefSeq_historical_alignments/GCF_000001405.40-RS_2023_03_knownrefseq_alns.gff.gz to /ncbi-dir/genomes
receiving incremental file list
rsync: link_stat "/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.40_GRCh38.p14/RefSeq_historical_alignments/GCF_000001405.40-RS_2023_03_knownrefseq_alns.gff.gz" (in genomes) failed: No such file or directory (2)

sent 8 bytes  received 262 bytes  108.00 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1865) [Receiver=3.2.7]

@bsgiles73 bsgiles73 merged commit c2af389 into main Apr 24, 2024
1 check passed
@bsgiles73 bsgiles73 deleted the IPVC-2379-add-files-to-download-config branch April 24, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants