Some cool new generalization, and check out function biomartr:::supported_biotypes(db = "refseq").
This function will simplify a lot of stuff downstream. (#104)
Tests are now much quicker to run, because biomartr::is.genome.available (which is used basically everywhere) now reads files with data.table instead of reader. (#104)

Bug fixes

Fixing bug in is.genome.available() where the skip_bacteria argument was not passed on internally to is.genome.available.refseq.genbank() (#105)

Assets 3

04 Oct 14:35

HajkD

v1.0.5

42c0ae0

Genomic Data Retrieval with R

Package generalization

Over 5000 lines have been edited, most of them removed (#100), to generalize the package to make it more safe for future
development. This progress is still ongoing.

@Roleren is joining as package author and new core developer of biomartr.

New features

Ensembl genomes is no longer a different database compared to ensembl in biomaRt, since this split is artifical.
It is adviced to use only "ensembl" as db from now on, but "ensemblgenomes" will still work.
Annotation did mean gff, but it should be both gff and gtf getter, with format specification, this is now fixed and generalized.
Added in new kingdom for ensembl: protists supportwith correct collection getters
The retrieval from the UniProt database is now updated to the new API/FTP path system. Now users
can retrieve proteomes using the functions getProteome(db = "uniprot", ...) and getProteomeSet(db = "uniprot", ...) (see #82)
new function getBioSet: Generic Bio data set extractor
new function getBio: A wrapper to all bio getters, selected with 'type' argument
a new function getUniProtSTATS(): Retrieve UniProt Database Information File (STATS)

Power user cache

The package now supports caching of back end files which used to be saved to /tmp folder (i.e. lost on computer restart).
This make it easy for power users who want higher speed. For more info, see the function ?cachedir_set

Bug fixes

Fixed many wrong urls and non working functions, more tests are added to make sure they work.
Fixed fungi collection accessor for ensembl

Contributors

Roleren

Assets 3

19 Jun 17:59

HajkD

v1.0.4

3b7549a

Genomic Data Retrieval with R

Patch release to fix major big where retrieval stopped due to parsing issues in `getAssemblySummary()`

New Features

in getSummaryFile() all columns of the assembly_summary.txt are now specified with names and correct data types (#92)

Bug Fixes

whenever the low-level function getKingdomAssemblySummary() was called by all get*() functions, due to an error in the assembly_summary.txt file for viruses where the total gene count was stored as character and not as integer (as is the case for all other assembly_summary.txt files), an error occurred stating that dplyr::bind_rows() cannot join column $X35 due to differences in data types. This has now been resolved by parsing the correct data types with readr(#92)

Assets 3

08 May 14:00

HajkD

v1.0.3

c1e1f6e

Genomic Data Retrieval with R

Minor maintenance fixes to ensure smooth installation on R versions >4.0.0.

adding pull request #88 which fixes issues with http to https curl requests (Many thanks to @Roleren)

Contributors

Roleren

Assets 3

22 Feb 16:38

HajkD

v1.0.2

4c18111

Genomic Data Retrieval with R

biomartr 1.0.2

Overall, this new version fixes a big internet connection issue to NCBI and ENSEMBL. Users can now
reinstall the new version from CRAN and will realize that their initially failing downloads will run now,
without having to change their code.

New Functions

New function check_annotation_biomartr() helps to check whether downloaded GFF or GTF files are corrupt. Find more details here
new function getCollectionSet() allows users to retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats of multiple species

Example:

# define scientific names of species for which
# collections shall be retrieved
organism_list <- c("Arabidopsis thaliana", 
                   "Arabidopsis lyrata", 
                   "Capsella rubella")
# download the collection of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/collection'
 getCollectionSet( db       = "refseq", 
             organism = organism_list, 
             path = "set_collections")

New Features

the getGFF() function receives a new argument remove_annotation_outliers to enable users to remove corrupt lines from a GFF file
Example:

Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE)

the getGFFSet() function receives a new argument remove_annotation_outliers to enable users to remove corrupt lines from a GFF file
the getGTF() function receives a new argument remove_annotation_outliers to enable users to remove corrupt lines from a GTF file
adding a new message system to biomartr::organismBM(), biomartr::organismAttributes(), and biomartr::organismFilters() so that large API queries don't seem so unresponsive
getCollection() receives new arguments release, remove_annotation_outliers, and gunzip that will now be passed on to downstream retrieval functions
the getGTF(), getGenome() and getGenomeSet() functions receives a new argument assembly_type = "toplevel" to enable users to choose between toplevel and primary assembly when using ensembl database. Setting assembly_type = "primary_assembly" will save a lot a space on hard drives for people using large ensembl genomes.
all get*() functions with release argument now check if the ENSEMBL release is >45 (Many thanks to @Roleren #31 #61)
in all get*() functions, the readr::write_tsv(path = ) was exchanged to readr::write_tsv(file = ), since the readr package version > 1.4.0 is depreciating the path argument.
tbl_df() was deprecated in dplyr 1.0.0.
Please use tibble::as_tibble() instead. -> adjusted organismBM() accordingly
custom_download(), getGENOMEREPORT(), and other download functions now have specified withr::local_options(timeout = max(30000000, getOption("timeout"))) which extends the default 60sec timeout to 30000000sec

Bug Fixes

Fixing bug where genome availability check in getCollection() was only performed in NCBI RefSeq and not in other databases due to a constant used in is.genome.available() rather than a variable (Many thanks to Takahiro Yamada for catching the bug) #53
fixing an issue that caused the read_cds() function to fail in data.table mode (Many thanks to Clement Kent) #57
fixing an SSL bug that was found on Ubuntu 20.04 systems #66 (Many thanks to Håkon Tjeldnes)
fixing global variable issue that caused clean.retrieval() to fail when no documentation file was in a meta.retrieval() folder
The NCBI recently started adding NA values as FTP file paths in their species summary files for species without reference genomes. As a result meta.retrieval() stopped working, because no FTP paths were found for some species. This issue was now fixed by adding the filter rule !is.na(ftp_path) into all get*() functions (Many thanks for making me aware of this issue Ashok Kumar Sharma #34 and Dominik Merges #72)
Fixing an issue in custom_download() where the method argument was causing issues when downloading from https directed ftp sites (Many thanks to @cmatKhan) #76
Fixing issue when trying to combine multiple summary-stats files where NA's were present in the list item that was passed along for combination in meta.retrieval() #73 (Many thanks to Dominik Merges)
Fixing a bug in download.database.all() where the lack of removing listed file *-metadata.json caused corruption of the download process (Many thanks to Jaruwatana Lotharukpong)

Contributors

Roleren and cmatKhan

Assets 2

12 Dec 14:40

HajkD

v0.9.1

76f279b

Genomic Data Retrieval

Minor updates to comply with CRAN policy.

Assets 2

19 May 21:57

HajkD

v0.9.0

12fb224

Genomic Data Retrieval

Please be aware that as of April 2019, ENSEMBLGENOMES
was retired (see details here). Hence, all biomartr functions were updated
and won't support data retrieval from ENSEMBLGENOMES servers anymore.

New Functions

New function clean.retrieval() enables formatting and automatic unzipping of meta.retrieval output (find out more here: https://ropensci.github.io/biomartr/articles/MetaGenome_Retrieval.html#un-zipping-downloaded-files)
New function getGenomeSet() allows users to easily retrieve genomes of multiple specified species.
In addition, the genome summary statistics for all retrieved species will be stored as well to provide
users with insights regarding the genome assembly quality of each species. This file can be used as Supplementary Information file
in publications to facilitate reproducible research.
New function getProteomeSet() allows users to easily retrieve proteomes of multiple specified species
New function getCDSSet() allows users to easily retrieve coding sequences of multiple specified species
New function getGFFSet() allows users to easily retrieve GFF annotation files of multiple specified species
New function getRNASet() allows users to easily retrieve RNA sequences of multiple specified species
New function summary_genome() allows users to retrieve summary statistics for a genome assembly file to assess
the influence of genome assembly qualities when performing comparative genomics tasks
New function summary_cds() allows users to retrieve summary statistics for a coding sequence (CDS) file.
We noticed, that many CDS files stored in NCBI or ENSEMBL databases contain sequences that aren't divisible by 3 (division into codons).
This makes it difficult to divide CDS into codons for e.g. codon alignments or translation into protein sequences. In
addition, some CDS files contain a significant amount of sequences that do not start with AUG (start codon).
This function enables users to quantify how many of these sequences exist in a downloaded CDS file to process
these files according to the analyses at hand.

New Features of Existing Functions

the default value of argument reference in meta.retrieval() changed from reference = TRUE to reference = FALSE.
This way all genomes (reference AND non-reference) genomes will be downloaded by default. This is what users seem to prefer.
getCollection() now also retrieves GTF files when db = 'ensembl'
getAssemblyStats() now also performs md5 checksum test
all md5 checksum tests now retrieve the new md5checkfile format from NCBI RefSeq and Genbank
getGTF(): users can now specify the NCBI Taxonomy ID or Accession ID in addition to the scientific name in argument 'organism' to retrieve genome assemblies
getGFF(): users can now specify the NCBI Taxonomy ID or Accession ID for ENSEMBL in addition to the scientific name in argument 'organism' to retrieve genome assemblies
getMarts() will now throw an error when BioMart servers cannot be reached (#36)
getGenome() now also stores the genome summary statistics (see ?summary_genome()) for the retrieved species in the documentation folder to provide
users with insights regarding the genome assembly quality
In all get*() functions the default for argument reference is now set from reference = TRUE to reference = FALSE (= new default)
all get*() functions now received a new argument release which allows users to retrieve
specific release versions of genomes, proteomes, etc from ENSEMBL and ENSEMBLGENOMES
all get*() functions received two new arguments clean_retrieval and gunzip which
allows users to upzip the downloaded files directly in the get*() function call and rename
the file for more convenient downstream analyses

Assets 2

27 Jun 20:42

HajkD

v0.8.0

fa6dc0e

v0.8.0

tag new release

Assets 2

17 Jan 10:58

HajkD

v0.7.0

89d4c90

v0.7.0

Function changes:

the function meta.retrieval() will now pick up the download at the organism where it left off and will report which species have already been retrieved
all get*() functions and the meta.retrieval() function receive a new argument reference which allows users to retrieve not-reference or not-representative genome versions when downloading from NCBI RefSeq or NCBI Genbank
the argument order in meta.retrieval() changed from meta.retrieval(kingdom, group, db, ...) to meta.retrieval(db,kingdom, group, ...) to make the argument order more consistent with the get*() functions
the argument order in getGroups() changed from getGroups(kingdom, db) to getGroups(db, kingdom) to make the argument order more consistent with the get*() and meta.retrieval() functions

New Functions:

new internal functions existingOrganisms() and existingOrganisms_ensembl() which check the organisms that have already been downloaded

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

biomartr 1.0.7

New features

Generalization of Biomart database access #108

Bug fixes

biomartr 1.0.6

New features

Bug fixes

Package generalization

New features

Power user cache

Bug fixes

Contributors

Patch release to fix major big where retrieval stopped due to parsing issues in `getAssemblySummary()`

New Features

Bug Fixes

Minor maintenance fixes to ensure smooth installation on R versions >4.0.0.

Contributors

biomartr 1.0.2

New Functions

New Features

Bug Fixes

Contributors

New Functions

New Features of Existing Functions

Releases: ropensci/biomartr

Genomic Data Retrieval with R

biomartr 1.0.7

New features

Generalization of Biomart database access #108

Bug fixes

Genomic Data Retrieval with R

biomartr 1.0.6

New features

Bug fixes

Genomic Data Retrieval with R

Package generalization

New features

Power user cache

Bug fixes

Contributors

Genomic Data Retrieval with R

Patch release to fix major big where retrieval stopped due to parsing issues in getAssemblySummary()

New Features

Bug Fixes

Genomic Data Retrieval with R

Minor maintenance fixes to ensure smooth installation on R versions >4.0.0.

Contributors

Genomic Data Retrieval with R

biomartr 1.0.2

New Functions

New Features

Bug Fixes

Contributors

Genomic Data Retrieval

Genomic Data Retrieval

New Functions

New Features of Existing Functions

v0.8.0

v0.7.0

Patch release to fix major big where retrieval stopped due to parsing issues in `getAssemblySummary()`