Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download all datasets contained in all R-packages #185

Closed
giuseppec opened this issue Mar 17, 2016 · 7 comments
Closed

Download all datasets contained in all R-packages #185

giuseppec opened this issue Mar 17, 2016 · 7 comments

Comments

@giuseppec
Copy link
Member

We can do something like (ugly code) and then upload everything

# install all packages
# pkg = available.packages()
# for(i in 1:nrow(pkg)) install.packages(pkgs = pkg[i,"Package"])

# get table of all available data sets from all available packages
rdat = data(package = .packages(all.available = TRUE))
rdat = as.data.frame(rdat$results)
rdat$Package.Version = sapply(rdat$Package, function(x) as.character(packageVersion(x)))
# here we remove strange data set names
rdat = rdat[!grepl("\\(|\\)", rdat$Item),]

rdat.unique = unique(rdat[,c("Package", "Package.Version")])
ret = setNames(vector("list", nrow(rdat.unique)), 
  paste0(rdat.unique$Package, "_", rdat.unique$Package.Version))

for(i in seq_along(rdat.unique$Package)) {
  # get all data set names from package 'i'
  dat.names = as.character(subset(rdat, Package == rdat.unique$Package[i])[,"Item"])
  data(list = dat.names, package = as.character(rdat.unique$Package[i]))

  ret[[i]] = setNames(lapply(dat.names, function(dn) {
    return(tryCatch(get(dn), error = function(e) e, warning = function(w) w))
  }), dat.names)

  loaded.pkg = setdiff(loadedNamespaces(),
    c("stats", "graphics", "grDevices", "utils", "datasets", "methods", "base", "tools"))
  lapply(loaded.pkg, function(x) try(unloadNamespace(x)))

  cat(as.character(rdat.unique$Package[i]), ": all data sets downloaded", fill = TRUE)
}
@HeidiSeibold
Copy link
Member

I asked on twitter if there are ways to do this without having to install the packages.

This is the best answer I got:
https://twitter.com/GaborCsardi/status/725776034910617600

Seems pretty promising 😃

@HeidiSeibold
Copy link
Member

http://vincentarelbundock.github.io/Rdatasets/datasets.html

@jakobbossek
Copy link
Collaborator

Thanks Heidi!
Tried the approach proposed by Gabor. Pretty easy with his gh package and githubs's fantastic code search API to obtain a list of all rda files inside cran:

devtools::install_github("gaborcsardi/gh")
library(gh)
repos = gh("GET /search/code?q=user:cran+extension:rda")
catf("#Repos: %i", repos$total_count)

This way we can download the rda files only, e.g., via repos$items[[i]]$html_url. However, there is no easy way to access meta data for the data set, e.g., data set description, default target feature, citations etc. An possibility is to download the corresponding Rd docs as well an parse these. Uploading stuff to OpenML without at least a meaningful description seems useless to me.

Another point: why should we avoid downloading all packages by the crawler. Is it because of time and memory? We can simply download each package, extract the data sets, upload to OpenML and remove the package afterwards. The time aspect is unimportant. The crawler does not need to be fast.

@jakobbossek
Copy link
Collaborator

Started to work on a crawler which operates on the github cran repositories and reads 1) the data itself and 2) metadata from the corresponding Rd file. Works well so far. Just need to parallelize stuff and handle potential errors.

@cvitolo
Copy link

cvitolo commented May 25, 2017

@jakobbossek I'd love to see the results of your crawler/experiment. Did you publish it?

@giuseppec
Copy link
Member Author

A huge collection can be found here http://vincentarelbundock.github.io/Rdatasets/datasets.html

@giuseppec
Copy link
Member Author

Closing as this should not be part of the R package. It's a separate project, i.e., writing a bot that crawls data sets and uploads them to openml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants