refseq download error #76

cmatKhan · 2021-10-07T23:17:20Z

I am trying to download a handful of yeast genomes from ref seq with the following script:

library(biomartr)
library(tidyverse)

post_wgd_yeasts = c(
  "Naumovozyma castellii",
  "Naumovozyma dairenensis",
  "Tetrapisispora blattae",
  "Tetrapisispora phaffii",
  "Kazachstania africana",
  "Kazachstania naganishii"
)

pre_wgd_yeasts = c(
  "Torulaspora delbrueckii"
)

avail = map(post_wgd_yeasts, is.genome.available, db="refseq")

is.genome.available(db = "refseq", pre_wgd_yeasts[[1]])

getGenomeSet(
  db = "genbank",
  post_wgd_yeasts,
  path = "/media/chase/Seagate Backup Plus Drive/yeast_genomes"
)

using either refseq or genbank, I get the following error:

Content type 'unknown' length 678885 bytes (662 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_other/assembly_summary.txt'                                                                                                                            
Content type 'unknown' length 606909 bytes (592 KB)
==================================================
                                                                                                                                                                                                                         

Completed!
Now continue with species download ...
The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/012/850/905/GCA_012850905.1_ASM1285090v1/GCA_012850905.1_ASM1285090v1_genomic.fna.gz' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/012/850/905/GCA_012850905.1_ASM1285090v1/md5checksums.txt' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
Genome download of Naumovozyma_castellii is completed!
The download session seems to have timed out at the FTP site 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/012/850/905/GCA_012850905.1_ASM1285090v1/GCA_012850905.1_ASM1285090v1_genomic.fna.gz'. This could be due to an overload of queries to the databases. Please restart this function to continue the data retrieval process or wait for a while before restarting this function in case your IP address was logged due to an query overload on the server side.
Error: Please provide a valid file path to your genome assembly file.                                                                                                               
In addition: Warning messages:
1: In download.file(url, ...) :
  downloaded length 114371376 != reported length 329358039
2: In download.file(url, ...) :
  URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt': Timeout of 60 seconds was reached
3: One or more parsing issues, see `problems()` for details 
4: One or more parsing issues, see `problems()` for details 
5: One or more parsing issues, see `problems()` for details 
6: More than one entry has been found for 'Naumovozyma castellii'. Only the first entry 'Naumovozyma castellii' has been used for subsequent genome retrieval. If you wish to download a different version, please use the NCBI accession ID when specifying the 'organism' argument. See ?is.genome.available for examples.

When I paste the FTP link into my browser, it downloads just fine.

Any suggestions?

The text was updated successfully, but these errors were encountered:

cmatKhan · 2021-10-08T00:27:15Z

I am also unsure why it says there is more than one entry. When I look manually, there is only one entry per organism:

> x = map(post_wgd_yeasts, is.genome.available, db="refseq", details=TRUE)
                                                                                                                           
> map(x, nrow)
[[1]]
[1] 1

[[2]]
[1] 1

[[3]]
[1] 1

[[4]]
[1] 1

[[5]]
[1] 1

cmatKhan · 2021-10-08T01:45:20Z

for what it is worth, this did work:

avail = map(post_wgd_yeasts, is.genome.available, db="refseq", details=TRUE)

test = function(ftp_row){
  accession = ftp_row[['assembly_accession']]
  asm_name = ftp_row[['asm_name']]
  ftp_addr = ftp_row[['ftp_path']]
  gtf_filename = paste(accession, 
                        asm_name,  
                        "genomic.gtf.gz", 
                        sep = "_")
  gtf_url = file.path(ftp_addr, gtf_filename)
  output = file.path("/media/chase/Seagate Backup Plus Drive/yeast_genomes", gtf_filename)

  download.file(
    url = gtf_url,
    destfile = output
  )
}

map(avail, apply, 1, test)

cmatKhan · 2021-10-08T14:15:39Z

OK -- I think I have a solution to this.

Here is my environment:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ape_5.5              forcats_0.5.1        stringr_1.4.0        dplyr_1.0.7          purrr_0.3.4          readr_2.0.2          tidyr_1.1.4          tibble_3.1.5         ggplot2_3.3.5        tidyverse_1.3.1      biomartr_1.0.2       rtracklayer_1.52.1  
[13] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4  IRanges_2.26.0       S4Vectors_0.30.2     BiocGenerics_0.38.0  biomaRt_2.48.3      

loaded via a namespace (and not attached):
  [1] colorspace_2.0-2            rjson_0.2.20                ellipsis_0.3.2              rprojroot_2.0.2             htmlTable_2.2.1             XVector_0.32.0              fs_1.5.0                    base64enc_0.1-3             dichromat_2.0-0            
 [10] rstudioapi_0.13             bit64_4.0.5                 lubridate_1.8.0             AnnotationDbi_1.54.1        fansi_0.5.0                 xml2_1.3.2                  splines_4.1.1               cachem_1.0.6                knitr_1.36                 
 [19] jsonlite_1.7.2              Formula_1.2-4               Rsamtools_2.8.0             broom_0.7.9                 cluster_2.1.2               dbplyr_2.1.1                png_0.1-7                   BiocManager_1.30.16         compiler_4.1.1             
 [28] httr_1.4.2                  backports_1.2.1             assertthat_0.2.1            Matrix_1.3-4                fastmap_1.1.0               lazyeval_0.2.2              cli_3.0.1                   htmltools_0.5.2             prettyunits_1.1.1          
 [37] tools_4.1.1                 gtable_0.3.0                glue_1.4.2                  GenomeInfoDbData_1.2.6      rappdirs_0.3.3              Rcpp_1.0.7                  Biobase_2.52.0              cellranger_1.1.0            vctrs_0.3.8                
 [46] Biostrings_2.60.2           nlme_3.1-153                xfun_0.26                   rvest_1.0.1                 lifecycle_1.0.1             restfulr_0.0.13             XML_3.99-0.8                zlibbioc_1.38.0             scales_1.1.1               
 [55] vroom_1.5.5                 BSgenome_1.60.0             VariantAnnotation_1.38.0    hms_1.1.1                   MatrixGenerics_1.4.3        SummarizedExperiment_1.22.0 AnnotationFilter_1.16.0     RColorBrewer_1.1-2          yaml_2.2.1                 
 [64] curl_4.3.2                  memoise_2.0.0               gridExtra_2.3               downloader_0.4              rpart_4.1-15                latticeExtra_0.6-29         stringi_1.7.5               RSQLite_2.2.8               highr_0.9                  
 [73] BiocIO_1.2.0                checkmate_2.0.0             GenomicFeatures_1.44.2      filelock_1.0.2              BiocParallel_1.26.2         rlang_0.4.11                pkgconfig_2.0.3             matrixStats_0.61.0          bitops_1.0-7               
 [82] evaluate_0.14               lattice_0.20-45             GenomicAlignments_1.28.0    htmlwidgets_1.5.4           bit_4.0.4                   tidyselect_1.1.1            magrittr_2.0.1              R6_2.5.1                    generics_0.1.0             
 [91] Hmisc_4.6-0                 DelayedArray_0.18.0         DBI_1.1.1                   withr_2.4.2                 haven_2.4.3                 pillar_1.6.3                foreign_0.8-81              survival_3.2-13             KEGGREST_1.32.0            
[100] RCurl_1.98-1.5              nnet_7.3-16                 modelr_0.1.8                crayon_1.4.1                utf8_1.2.2                  BiocFileCache_2.0.0         rmarkdown_2.11              tzdb_0.1.2                  usethis_2.0.1              
[109] jpeg_0.1-9                  progress_1.2.2              readxl_1.3.1                data.table_1.14.2           blob_1.2.2                  reprex_2.0.1                digest_0.6.28               munsell_0.5.0

When I use the 'custom_downloader' script directly like so:



#' @title Helper function to perform customized downloads
#' @description To achieve the most stable download experience,
#' ftp file downloads are customized for each operating system.
#' @param ... additional arguments that shall be passed to
#' \code{\link[downloader]{download}}
#' @author Hajk-Georg Drost
#' @noRd
custom_download <- function(url, ...) {

  if (RCurl::url.exists(url)) {
    operating_sys <- Sys.info()[1]

    if (operating_sys == "Darwin") {
      downloader::download(
        url = url, ...,
        method = "curl",
        extra = "--retry 3",
        cacheOK = FALSE,
        quiet = TRUE
      )

    }

    if (operating_sys == "Linux") {
      downloader::download(
        url = url, ...,
        method = "wget",
        extra = "--tries 3 --continue",
        cacheOK = FALSE,
        quiet = TRUE
      )
    }

    if (operating_sys == "Windows") {
      downloader::download(url = url, ...,
                           method = "internal",
                           cacheOK = FALSE,
                           quiet = TRUE)
    }
  } else {
    message(
      "The FTP link: '",url,"' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?"
    )
    return(FALSE)
  }
}

custom_download("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gtf.gz",
                path = "data/test.gff")

Then I get this error:

Error in download.file(url, method = method, ...) : 
  formal argument "method" matched by multiple actual arguments

Which corresponds to these lines in the downloader::download() function

      suppressWarnings(download.file(url, method = method, 
        ...))
    }
    else {
      if (isR32 && capabilities("libcurl")) {
        method <- "libcurl"
      }
      else if (nzchar(Sys.which("wget")[1])) {
        method <- "wget"
      }
      else if (nzchar(Sys.which("curl")[1])) {
        method <- "curl"
        orig_extra_options <- getOption("download.file.extra")
        on.exit(options(download.file.extra = orig_extra_options))
        options(download.file.extra = paste("-L", orig_extra_options))
      }

So, there is an issue with setting method in the custom_downloader script. I don't have a good explanation of why.

However, if I simply remove that line such that the Linux download function looks like this:

    if (operating_sys == "Linux") {
      downloader::download(
        url = url, ...,
        extra = "--tries 3 --continue",
        cacheOK = FALSE,
        quiet = TRUE
      )
    }

It works just fine.

Re: Issue ropensci#76, remove the 'method' argument from custom_download. When this is included, the following error occurs: Error in download.file(url, method = method, ...) : formal argument "method" matched by multiple actual arguments

cmatKhan · 2021-10-08T14:45:50Z

hm. that does work. But if I try to download the annotation files and genomes to the same directory, then there is a problem with the attempt to unzip:

this occurs if I call getGenomeSet after getGffSet tot the same output directory:

Cleaning file names for more convenient downstream processing ...
Cleaning file names and unzipping files ...
Unzipping file Kafricana.gff' ...
Unzipping file Kazachstania_africana_genomic_refseq.fna.gz' ...
Unzipping file Kazachstania_naganishii_genomic_refseq.fna.gz' ...
Unzipping file Knaganishii.gff' ...
Error in decompressFile.default(filename = filename, ..., ext = ext, FUN = FUN) : 
  File already exists: /media/chase/Seagate Backup Plus Drive/yeast_genomes/KNA.fna

I could disable this by turning off the unzip function, but it is probably better to keep a list of the input files and unzip those rather than (I suspect) calling list.files and unzipping all files.

Also, I disagree that it is a good practice to rename the files to the species name. I suggest at least keeping the accession number in the filename, eg GCF_12345_scervisiae.gff.gz.

Further, my suggested fix above of removing the 'method' argument works in terms of getting the files to download. However, I am still getting this error message:

The download session seems to have timed out at the FTP site 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz'. This could be due to an overload of queries to the databases. Please restart this function to continue the data retrieval process or wait for a while before restarting this function in case your IP address was logged due to an query overload on the server side.
The genome of 'Saccharomyces_cerevisiae' has been downloaded to '/media/chase/Seagate Backup Plus Drive/yeast_genomes' and has been named 'Saccharomyces_cerevisiae_genomic_refseq.fna.gz'.

which is occurring due to some error handling try/catch statement that is apparently catching an error unrelated to the success of the download.

@cmatKhan

…s causing issues when downloading from `https` directed `ftp` sites (Many thanks to @cmatKhan) #76

HajkD · 2021-11-13T17:41:36Z

Hi @cmatKhan

Thank you so much for this amazing trouble-shooting and for finding the solutions!

I now removed the method argument for Linux and macOS and the issue should be resolved now.

Does this work for you?

Regarding the unzip function when you download annotation into the same folder: I will look into this is detail.

Once again, thank you very much for this excellent work!

Best wishes,
Hajk

Roleren · 2023-09-27T10:33:40Z

I can confirm this fixes the issue.

This issue can now be closed.

HajkD added a commit that referenced this issue Nov 13, 2021

Fixing an issue in custom_download() where the method argument wa…

9f82a14

…s causing issues when downloading from `https` directed `ftp` sites (Many thanks to @cmatKhan) #76

This was referenced Nov 13, 2021

R session aborted in downloading process #75

Closed

Error during genome retrieval #77

Closed

jarditi0011 mentioned this issue Jul 19, 2022

Refseq download failure for Windows #81

Closed

HajkD closed this as completed Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refseq download error #76

refseq download error #76

cmatKhan commented Oct 7, 2021

cmatKhan commented Oct 8, 2021

cmatKhan commented Oct 8, 2021

cmatKhan commented Oct 8, 2021 •

edited

Loading

cmatKhan commented Oct 8, 2021

HajkD commented Nov 13, 2021

Roleren commented Sep 27, 2023

refseq download error #76

refseq download error #76

Comments

cmatKhan commented Oct 7, 2021

cmatKhan commented Oct 8, 2021

cmatKhan commented Oct 8, 2021

cmatKhan commented Oct 8, 2021 • edited Loading

cmatKhan commented Oct 8, 2021

HajkD commented Nov 13, 2021

Roleren commented Sep 27, 2023

cmatKhan commented Oct 8, 2021 •

edited

Loading