Releases: ropensci/biomartr
Genomic Data Retrieval with R
biomartr 1.0.7
New features
Generalization of Biomart database access #108
- Generalized biomart database interface (now uses https and port 433)
- added cache for biomart database overview
- added more unit tests for
listGenomes()
andbiomart()
Bug fixes
Genomic Data Retrieval with R
biomartr 1.0.6
New features
-
Some cool new generalization, and check out function
biomartr:::supported_biotypes(db = "refseq")
.
This function will simplify a lot of stuff downstream. (#104) -
Tests are now much quicker to run, because
biomartr::is.genome.available
(which is used basically everywhere) now reads files with data.table instead of reader. (#104)
Bug fixes
- Fixing bug in
is.genome.available()
where the skip_bacteria argument was not passed on internally tois.genome.available.refseq.genbank()
(#105)
Genomic Data Retrieval with R
Package generalization
Over 5000 lines have been edited, most of them removed (#100), to generalize the package to make it more safe for future
development. This progress is still ongoing.
- @Roleren is joining as package author and new core developer of
biomartr
.
New features
- Ensembl genomes is no longer a different database compared to ensembl in biomaRt, since this split is artifical.
It is adviced to use only "ensembl" as db from now on, but "ensemblgenomes" will still work. - Annotation did mean gff, but it should be both gff and gtf getter, with format specification, this is now fixed and generalized.
- Added in new kingdom for ensembl: protists supportwith correct collection getters
- The retrieval from the
UniProt
database is now updated to the new API/FTP path system. Now users
can retrieve proteomes using the functionsgetProteome(db = "uniprot", ...)
andgetProteomeSet(db = "uniprot", ...)
(see #82) - new function
getBioSet
: Generic Bio data set extractor - new function
getBio
: A wrapper to all bio getters, selected with 'type' argument - a new function
getUniProtSTATS()
: Retrieve UniProt Database Information File (STATS)
Power user cache
The package now supports caching of back end files which used to be saved to /tmp folder (i.e. lost on computer restart).
This make it easy for power users who want higher speed. For more info, see the function ?cachedir_set
Bug fixes
- Fixed many wrong urls and non working functions, more tests are added to make sure they work.
- Fixed fungi collection accessor for ensembl
Genomic Data Retrieval with R
Patch release to fix major big where retrieval stopped due to parsing issues in getAssemblySummary()
New Features
- in getSummaryFile() all columns of the assembly_summary.txt are now specified with names and correct data types (#92)
Bug Fixes
- whenever the low-level function
getKingdomAssemblySummary()
was called by allget*()
functions, due to an error in theassembly_summary.txt
file for viruses where the total gene count was stored as character and not as integer (as is the case for all otherassembly_summary.txt
files), an error occurred stating thatdplyr::bind_rows()
cannot join column$X35
due to differences in data types. This has now been resolved by parsing the correct data types withreadr
(#92)
Genomic Data Retrieval with R
Minor maintenance fixes to ensure smooth installation on R versions >4.0.0.
- adding pull request #88 which fixes issues with
http
tohttps
curl requests (Many thanks to @Roleren)
Genomic Data Retrieval with R
biomartr 1.0.2
Overall, this new version fixes a big internet connection issue to NCBI and ENSEMBL. Users can now
reinstall the new version from CRAN and will realize that their initially failing downloads will run now,
without having to change their code.
New Functions
-
New function
check_annotation_biomartr()
helps to check whether downloaded GFF or GTF files are corrupt. Find more details here -
new function
getCollectionSet()
allows users to retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats of multiple species
Example:
# define scientific names of species for which
# collections shall be retrieved
organism_list <- c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella")
# download the collection of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/collection'
getCollectionSet( db = "refseq",
organism = organism_list,
path = "set_collections")
New Features
- the
getGFF()
function receives a new argumentremove_annotation_outliers
to enable users to remove corrupt lines from a GFF file
Example:
Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE)
-
the
getGFFSet()
function receives a new argumentremove_annotation_outliers
to enable users to remove corrupt lines from a GFF file -
the
getGTF()
function receives a new argumentremove_annotation_outliers
to enable users to remove corrupt lines from a GTF file -
adding a new message system to
biomartr::organismBM()
,biomartr::organismAttributes()
, andbiomartr::organismFilters()
so that large API queries don't seem so unresponsive -
getCollection()
receives new argumentsrelease
,remove_annotation_outliers
, andgunzip
that will now be passed on to downstream retrieval functions -
the
getGTF()
,getGenome()
andgetGenomeSet()
functions receives a new argumentassembly_type = "toplevel"
to enable users to choose between toplevel and primary assembly when using ensembl database. Settingassembly_type = "primary_assembly"
will save a lot a space on hard drives for people using large ensembl genomes. -
all
get*()
functions withrelease
argument now check if the ENSEMBL release is >45 (Many thanks to @Roleren #31 #61) -
in all
get*()
functions, thereadr::write_tsv(path = )
was exchanged toreadr::write_tsv(file = )
, since thereadr
package version > 1.4.0 is depreciating thepath
argument. -
tbl_df()
was deprecated in dplyr 1.0.0.
Please usetibble::as_tibble()
instead. -> adjustedorganismBM()
accordingly -
custom_download()
,getGENOMEREPORT()
, and other download functions now have specifiedwithr::local_options(timeout = max(30000000, getOption("timeout")))
which extends the default 60sec timeout to 30000000sec
Bug Fixes
-
Fixing bug where genome availability check in
getCollection()
was only performed inNCBI RefSeq
and not in other databases due to a constant used inis.genome.available()
rather than a variable (Many thanks to Takahiro Yamada for catching the bug) #53 -
fixing an issue that caused the
read_cds()
function to fail indata.table
mode (Many thanks to Clement Kent) #57 -
fixing an
SSL
bug that was found onUbuntu 20.04
systems #66 (Many thanks to Håkon Tjeldnes) -
fixing global variable issue that caused
clean.retrieval()
to fail when no documentation file was in ameta.retrieval()
folder -
The NCBI recently started adding
NA
values as FTP file paths in theirspecies summary files
for species without reference genomes. As a resultmeta.retrieval()
stopped working, because no FTP paths were found for some species. This issue was now fixed by adding the filter rule!is.na(ftp_path)
into allget*()
functions (Many thanks for making me aware of this issue Ashok Kumar Sharma #34 and Dominik Merges #72) -
Fixing an issue in
custom_download()
where themethod
argument was causing issues when downloading fromhttps
directedftp
sites (Many thanks to @cmatKhan) #76 -
Fixing issue when trying to combine multiple summary-stats files where NA's were present in the list item that was passed along for combination in
meta.retrieval()
#73 (Many thanks to Dominik Merges) -
Fixing a bug in
download.database.all()
where the lack of removing listed file*-metadata.json
caused corruption of the download process (Many thanks to Jaruwatana Lotharukpong)
Genomic Data Retrieval
Minor updates to comply with CRAN policy.
Genomic Data Retrieval
Please be aware that as of April 2019, ENSEMBLGENOMES
was retired (see details here). Hence, all biomartr
functions were updated
and won't support data retrieval from ENSEMBLGENOMES
servers anymore.
New Functions
- New function
clean.retrieval()
enables formatting and automatic unzipping of meta.retrieval output (find out more here: https://ropensci.github.io/biomartr/articles/MetaGenome_Retrieval.html#un-zipping-downloaded-files) - New function
getGenomeSet()
allows users to easily retrieve genomes of multiple specified species.
In addition, the genome summary statistics for all retrieved species will be stored as well to provide
users with insights regarding the genome assembly quality of each species. This file can be used as Supplementary Information file
in publications to facilitate reproducible research. - New function
getProteomeSet()
allows users to easily retrieve proteomes of multiple specified species - New function
getCDSSet()
allows users to easily retrieve coding sequences of multiple specified species - New function
getGFFSet()
allows users to easily retrieve GFF annotation files of multiple specified species - New function
getRNASet()
allows users to easily retrieve RNA sequences of multiple specified species - New function
summary_genome()
allows users to retrieve summary statistics for a genome assembly file to assess
the influence of genome assembly qualities when performing comparative genomics tasks - New function
summary_cds()
allows users to retrieve summary statistics for a coding sequence (CDS) file.
We noticed, that many CDS files stored in NCBI or ENSEMBL databases contain sequences that aren't divisible by 3 (division into codons).
This makes it difficult to divide CDS into codons for e.g. codon alignments or translation into protein sequences. In
addition, some CDS files contain a significant amount of sequences that do not start with AUG (start codon).
This function enables users to quantify how many of these sequences exist in a downloaded CDS file to process
these files according to the analyses at hand.
New Features of Existing Functions
- the default value of argument
reference
inmeta.retrieval()
changed fromreference = TRUE
toreference = FALSE
.
This way all genomes (reference AND non-reference) genomes will be downloaded by default. This is what users seem to prefer. getCollection()
now also retrievesGTF
files whendb = 'ensembl'
getAssemblyStats()
now also performs md5 checksum test- all md5 checksum tests now retrieve the new md5checkfile format from NCBI RefSeq and Genbank
getGTF()
: users can now specify the NCBI Taxonomy ID or Accession ID in addition to the scientific name in argument 'organism' to retrieve genome assembliesgetGFF()
: users can now specify the NCBI Taxonomy ID or Accession ID for ENSEMBL in addition to the scientific name in argument 'organism' to retrieve genome assembliesgetMarts()
will now throw an error when BioMart servers cannot be reached (#36)getGenome()
now also stores the genome summary statistics (see?summary_genome()
) for the retrieved species in thedocumentation
folder to provide
users with insights regarding the genome assembly quality- In all get*() functions the default for argument
reference
is now set fromreference = TRUE
toreference = FALSE
(= new default) - all
get*()
functions now received a new argumentrelease
which allows users to retrieve
specific release versions of genomes, proteomes, etc fromENSEMBL
andENSEMBLGENOMES
- all
get*()
functions received two new argumentsclean_retrieval
andgunzip
which
allows users to upzip the downloaded files directly in theget*()
function call and rename
the file for more convenient downstream analyses
v0.8.0
v0.7.0
Function changes:
-
the function meta.retrieval() will now pick up the download at the organism where it left off and will report which species have already been retrieved
-
all get*() functions and the meta.retrieval() function receive a new argument reference which allows users to retrieve not-reference or not-representative genome versions when downloading from NCBI RefSeq or NCBI Genbank
-
the argument order in meta.retrieval() changed from meta.retrieval(kingdom, group, db, ...) to meta.retrieval(db,kingdom, group, ...) to make the argument order more consistent with the get*() functions
-
the argument order in getGroups() changed from getGroups(kingdom, db) to getGroups(db, kingdom) to make the argument order more consistent with the get*() and meta.retrieval() functions
New Functions:
- new internal functions existingOrganisms() and existingOrganisms_ensembl() which check the organisms that have already been downloaded