Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refseq download error #76

Closed
cmatKhan opened this issue Oct 7, 2021 · 6 comments
Closed

refseq download error #76

cmatKhan opened this issue Oct 7, 2021 · 6 comments

Comments

@cmatKhan
Copy link

cmatKhan commented Oct 7, 2021

I am trying to download a handful of yeast genomes from ref seq with the following script:

library(biomartr)
library(tidyverse)

post_wgd_yeasts = c(
  "Naumovozyma castellii",
  "Naumovozyma dairenensis",
  "Tetrapisispora blattae",
  "Tetrapisispora phaffii",
  "Kazachstania africana",
  "Kazachstania naganishii"
)

pre_wgd_yeasts = c(
  "Torulaspora delbrueckii"
)

avail = map(post_wgd_yeasts, is.genome.available, db="refseq")

is.genome.available(db = "refseq", pre_wgd_yeasts[[1]])

getGenomeSet(
  db = "genbank",
  post_wgd_yeasts,
  path = "/media/chase/Seagate Backup Plus Drive/yeast_genomes"
)

using either refseq or genbank, I get the following error:

Content type 'unknown' length 678885 bytes (662 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_other/assembly_summary.txt'                                                                                                                            
Content type 'unknown' length 606909 bytes (592 KB)
==================================================
                                                                                                                                                                                                                         

Completed!
Now continue with species download ...
The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/012/850/905/GCA_012850905.1_ASM1285090v1/GCA_012850905.1_ASM1285090v1_genomic.fna.gz' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/012/850/905/GCA_012850905.1_ASM1285090v1/md5checksums.txt' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
Genome download of Naumovozyma_castellii is completed!
The download session seems to have timed out at the FTP site 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/012/850/905/GCA_012850905.1_ASM1285090v1/GCA_012850905.1_ASM1285090v1_genomic.fna.gz'. This could be due to an overload of queries to the databases. Please restart this function to continue the data retrieval process or wait for a while before restarting this function in case your IP address was logged due to an query overload on the server side.
Error: Please provide a valid file path to your genome assembly file.                                                                                                               
In addition: Warning messages:
1: In download.file(url, ...) :
  downloaded length 114371376 != reported length 329358039
2: In download.file(url, ...) :
  URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt': Timeout of 60 seconds was reached
3: One or more parsing issues, see `problems()` for details 
4: One or more parsing issues, see `problems()` for details 
5: One or more parsing issues, see `problems()` for details 
6: More than one entry has been found for 'Naumovozyma castellii'. Only the first entry 'Naumovozyma castellii' has been used for subsequent genome retrieval. If you wish to download a different version, please use the NCBI accession ID when specifying the 'organism' argument. See ?is.genome.available for examples. 

When I paste the FTP link into my browser, it downloads just fine.

Any suggestions?

@cmatKhan
Copy link
Author

cmatKhan commented Oct 8, 2021

I am also unsure why it says there is more than one entry. When I look manually, there is only one entry per organism:

> x = map(post_wgd_yeasts, is.genome.available, db="refseq", details=TRUE)
                                                                                                                           
> map(x, nrow)
[[1]]
[1] 1

[[2]]
[1] 1

[[3]]
[1] 1

[[4]]
[1] 1

[[5]]
[1] 1

@cmatKhan
Copy link
Author

cmatKhan commented Oct 8, 2021

for what it is worth, this did work:

avail = map(post_wgd_yeasts, is.genome.available, db="refseq", details=TRUE)

test = function(ftp_row){
  accession = ftp_row[['assembly_accession']]
  asm_name = ftp_row[['asm_name']]
  ftp_addr = ftp_row[['ftp_path']]
  gtf_filename = paste(accession, 
                        asm_name,  
                        "genomic.gtf.gz", 
                        sep = "_")
  gtf_url = file.path(ftp_addr, gtf_filename)
  output = file.path("/media/chase/Seagate Backup Plus Drive/yeast_genomes", gtf_filename)

  download.file(
    url = gtf_url,
    destfile = output
  )
}

map(avail, apply, 1, test)

@cmatKhan
Copy link
Author

cmatKhan commented Oct 8, 2021

OK -- I think I have a solution to this.

Here is my environment:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ape_5.5              forcats_0.5.1        stringr_1.4.0        dplyr_1.0.7          purrr_0.3.4          readr_2.0.2          tidyr_1.1.4          tibble_3.1.5         ggplot2_3.3.5        tidyverse_1.3.1      biomartr_1.0.2       rtracklayer_1.52.1  
[13] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4  IRanges_2.26.0       S4Vectors_0.30.2     BiocGenerics_0.38.0  biomaRt_2.48.3      

loaded via a namespace (and not attached):
  [1] colorspace_2.0-2            rjson_0.2.20                ellipsis_0.3.2              rprojroot_2.0.2             htmlTable_2.2.1             XVector_0.32.0              fs_1.5.0                    base64enc_0.1-3             dichromat_2.0-0            
 [10] rstudioapi_0.13             bit64_4.0.5                 lubridate_1.8.0             AnnotationDbi_1.54.1        fansi_0.5.0                 xml2_1.3.2                  splines_4.1.1               cachem_1.0.6                knitr_1.36                 
 [19] jsonlite_1.7.2              Formula_1.2-4               Rsamtools_2.8.0             broom_0.7.9                 cluster_2.1.2               dbplyr_2.1.1                png_0.1-7                   BiocManager_1.30.16         compiler_4.1.1             
 [28] httr_1.4.2                  backports_1.2.1             assertthat_0.2.1            Matrix_1.3-4                fastmap_1.1.0               lazyeval_0.2.2              cli_3.0.1                   htmltools_0.5.2             prettyunits_1.1.1          
 [37] tools_4.1.1                 gtable_0.3.0                glue_1.4.2                  GenomeInfoDbData_1.2.6      rappdirs_0.3.3              Rcpp_1.0.7                  Biobase_2.52.0              cellranger_1.1.0            vctrs_0.3.8                
 [46] Biostrings_2.60.2           nlme_3.1-153                xfun_0.26                   rvest_1.0.1                 lifecycle_1.0.1             restfulr_0.0.13             XML_3.99-0.8                zlibbioc_1.38.0             scales_1.1.1               
 [55] vroom_1.5.5                 BSgenome_1.60.0             VariantAnnotation_1.38.0    hms_1.1.1                   MatrixGenerics_1.4.3        SummarizedExperiment_1.22.0 AnnotationFilter_1.16.0     RColorBrewer_1.1-2          yaml_2.2.1                 
 [64] curl_4.3.2                  memoise_2.0.0               gridExtra_2.3               downloader_0.4              rpart_4.1-15                latticeExtra_0.6-29         stringi_1.7.5               RSQLite_2.2.8               highr_0.9                  
 [73] BiocIO_1.2.0                checkmate_2.0.0             GenomicFeatures_1.44.2      filelock_1.0.2              BiocParallel_1.26.2         rlang_0.4.11                pkgconfig_2.0.3             matrixStats_0.61.0          bitops_1.0-7               
 [82] evaluate_0.14               lattice_0.20-45             GenomicAlignments_1.28.0    htmlwidgets_1.5.4           bit_4.0.4                   tidyselect_1.1.1            magrittr_2.0.1              R6_2.5.1                    generics_0.1.0             
 [91] Hmisc_4.6-0                 DelayedArray_0.18.0         DBI_1.1.1                   withr_2.4.2                 haven_2.4.3                 pillar_1.6.3                foreign_0.8-81              survival_3.2-13             KEGGREST_1.32.0            
[100] RCurl_1.98-1.5              nnet_7.3-16                 modelr_0.1.8                crayon_1.4.1                utf8_1.2.2                  BiocFileCache_2.0.0         rmarkdown_2.11              tzdb_0.1.2                  usethis_2.0.1              
[109] jpeg_0.1-9                  progress_1.2.2              readxl_1.3.1                data.table_1.14.2           blob_1.2.2                  reprex_2.0.1                digest_0.6.28               munsell_0.5.0              

When I use the 'custom_downloader' script directly like so:



#' @title Helper function to perform customized downloads
#' @description To achieve the most stable download experience,
#' ftp file downloads are customized for each operating system.
#' @param ... additional arguments that shall be passed to
#' \code{\link[downloader]{download}}
#' @author Hajk-Georg Drost
#' @noRd
custom_download <- function(url, ...) {

  if (RCurl::url.exists(url)) {
    operating_sys <- Sys.info()[1]

    if (operating_sys == "Darwin") {
      downloader::download(
        url = url, ...,
        method = "curl",
        extra = "--retry 3",
        cacheOK = FALSE,
        quiet = TRUE
      )

    }

    if (operating_sys == "Linux") {
      downloader::download(
        url = url, ...,
        method = "wget",
        extra = "--tries 3 --continue",
        cacheOK = FALSE,
        quiet = TRUE
      )
    }

    if (operating_sys == "Windows") {
      downloader::download(url = url, ...,
                           method = "internal",
                           cacheOK = FALSE,
                           quiet = TRUE)
    }
  } else {
    message(
      "The FTP link: '",url,"' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?"
    )
    return(FALSE)
  }
}

custom_download("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gtf.gz",
                path = "data/test.gff")

Then I get this error:

Error in download.file(url, method = method, ...) : 
  formal argument "method" matched by multiple actual arguments

Which corresponds to these lines in the downloader::download() function

      suppressWarnings(download.file(url, method = method, 
        ...))
    }
    else {
      if (isR32 && capabilities("libcurl")) {
        method <- "libcurl"
      }
      else if (nzchar(Sys.which("wget")[1])) {
        method <- "wget"
      }
      else if (nzchar(Sys.which("curl")[1])) {
        method <- "curl"
        orig_extra_options <- getOption("download.file.extra")
        on.exit(options(download.file.extra = orig_extra_options))
        options(download.file.extra = paste("-L", orig_extra_options))
      }

So, there is an issue with setting method in the custom_downloader script. I don't have a good explanation of why.

However, if I simply remove that line such that the Linux download function looks like this:

    if (operating_sys == "Linux") {
      downloader::download(
        url = url, ...,
        extra = "--tries 3 --continue",
        cacheOK = FALSE,
        quiet = TRUE
      )
    }

It works just fine.

cmatKhan added a commit to cmatKhan/biomartr that referenced this issue Oct 8, 2021
Re: Issue ropensci#76, remove the 'method' argument from custom_download. When this is included, the following error occurs:

Error in download.file(url, method = method, ...) : 
  formal argument "method" matched by multiple actual arguments
@cmatKhan
Copy link
Author

cmatKhan commented Oct 8, 2021

hm. that does work. But if I try to download the annotation files and genomes to the same directory, then there is a problem with the attempt to unzip:

this occurs if I call getGenomeSet after getGffSet tot the same output directory:

Cleaning file names for more convenient downstream processing ...
Cleaning file names and unzipping files ...
Unzipping file Kafricana.gff' ...
Unzipping file Kazachstania_africana_genomic_refseq.fna.gz' ...
Unzipping file Kazachstania_naganishii_genomic_refseq.fna.gz' ...
Unzipping file Knaganishii.gff' ...
Error in decompressFile.default(filename = filename, ..., ext = ext, FUN = FUN) : 
  File already exists: /media/chase/Seagate Backup Plus Drive/yeast_genomes/KNA.fna

I could disable this by turning off the unzip function, but it is probably better to keep a list of the input files and unzip those rather than (I suspect) calling list.files and unzipping all files.

Also, I disagree that it is a good practice to rename the files to the species name. I suggest at least keeping the accession number in the filename, eg GCF_12345_scervisiae.gff.gz.

Further, my suggested fix above of removing the 'method' argument works in terms of getting the files to download. However, I am still getting this error message:

The download session seems to have timed out at the FTP site 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz'. This could be due to an overload of queries to the databases. Please restart this function to continue the data retrieval process or wait for a while before restarting this function in case your IP address was logged due to an query overload on the server side.
The genome of 'Saccharomyces_cerevisiae' has been downloaded to '/media/chase/Seagate Backup Plus Drive/yeast_genomes' and has been named 'Saccharomyces_cerevisiae_genomic_refseq.fna.gz'.     

which is occurring due to some error handling try/catch statement that is apparently catching an error unrelated to the success of the download.

HajkD added a commit that referenced this issue Nov 13, 2021
…s causing issues when downloading from `https` directed `ftp` sites (Many thanks to @cmatKhan) #76
@HajkD
Copy link
Member

HajkD commented Nov 13, 2021

Hi @cmatKhan

Thank you so much for this amazing trouble-shooting and for finding the solutions!

I now removed the method argument for Linux and macOS and the issue should be resolved now.

Does this work for you?

Regarding the unzip function when you download annotation into the same folder: I will look into this is detail.

Once again, thank you very much for this excellent work!

Best wishes,
Hajk

@Roleren
Copy link
Contributor

Roleren commented Sep 27, 2023

I can confirm this fixes the issue.

This issue can now be closed.

@HajkD HajkD closed this as completed Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants