-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future of RDatasets.jl #141
Comments
i don't really julia, but in many languages, the general approach is to make more repositories for the new data, then link them from the original, so that a downstream consumer can easily choose to get just one thing or all the things |
There are surely different approaches. It would 100% be possible to replace or supplement this package with one that lazily downloads data from CRAN. The advantage to packaging in the data as is currently done is that no network connections are made at runtime, so there's more reliability. There's also no need to have the hassle of cache directories and the like. We also don't have to worry about CRAN going down or accidentally hammering CRAN, CRAN mirror selection. On the other hand, size would become a concern once there are lots of very large datasets. My suggestion here was more along the lines of a minimal move from the current approach to something which can also support other Julia orgs where there is a lot of relevant data on CRAN. |
I have made a minimal initial version of the lazy downloader at https://github.com/frankier/RDataGet.jl I haven't yet registered the package. Is there interest in transferring this to JuliaStats (keeping me as maintainer)? There is also the possibility of combining this lazy approach with the multiple dataset repo approach so that different domain orgs can keep repo of datasets without needing to hit CRAN as well as having everything on CRAN available for demos/examples. |
My preference as a very light user would be that the top [n] datasets covering the most common datasets used (e.g. penguins, iris, etc) are included. And that the package would lazy load the rest that are available. |
Interesting. I imagine we could support both "standard" datasets like now and also download additional datasets from CRAN. Have you considered using Pkg artifacts or DataDeps.jl for that? They sound like the right tool for this task. |
Yes I think that this would be nice possible and the nicest default behavior. There are some edge cases -- e.g. CRAN packages have different versions whereas bundled data has only a single version, but I think it would be possible to have a reasonable defaults of using whatever bundled version there is while always getting the newest while making it possible to get specific versions of any dataset on CRAN if needed.
Okay great to hear you are interested! I did take a look at both, but as I understand both are really about referring to a static/fixed set of resources. On the other hand, there is the potential for allowing users to specify caching periods beyond a single Julia session, in which case we need some place to store the dataset. Some ducking reveals https://github.com/JuliaPackaging/Scratch.jl which provides per-package data directories and https://github.com/chengchingwen/OhMyArtifacts.jl which allows dynamic artifacts to be stored in them. Would you be likely to be able to review a pull request based on adding this lazy downloading functionality to RDatasets.jl? Do you think this should build on Scratch.jl or OhMyArtifacts.jl? |
OhMyArtifacts.jl seems interesting. Scratch.jl is intended for data that is modified locally, which isn't the case here. Feel free to make a PR and I (or others) will try to review it. |
I came to this repo wondering if it had been "artifactized" yet and then found this discussion. Artifacts seem like a better fit in principle: these data sets are immutable, can be content-addressed and shared with any packages that want to use them. Serving them as artifacts will also allow our packager server system to cache and distribute them globally and ensure reproducibility in case the upstream data sets are modified over time. It's not uncommon for people hosting files to move them, delete them or silently modify them. Artifacts can also be marked as lazy, which will cause them to be downloaded on-demand rather than eagerly. The only issue I can think of with artifacts is that they will not get "garbage collected" unless all the packages that refer to them get garbage collected by the package manager. That means that once you use a lazy dataset artifact referred to by RDatasets it will stay on your system forever unless you manually go in and delete it from the I think the best option might be to just use a mix of eager and lazy artifacts—eager for datasets you want to download by default and lazy for one you want to provide on demand—and provide and API from RDatasets for cleaning up artifacts. |
cc @staticfloat since he might find this discussion interesting |
My understanding is also that we cannot realistically use standard artifacts for all datasets that live on CRAN given their number and the fact that they can be updated at any time: that would require updating Artifacts.toml and tagging new releases all the time, right? The distinction between default (eager) and additional (lazy) datasets, possibly handled using different mechanisms, seems more appropriate. |
It could be automated but I guess that's pretty annoying. It's unfortunate that this makes RDatasets inherently unreproducible since you can't know what version of a data set was used. |
It's not exactly unreproducible. In the current version of More generally, I see there being a few potential ways artifacts can be used in the context of a package like
In the last two cases, I believe For me an ideal scenario would be a mix between 1, 2, and 3. CRAN datasets are done using 1, while a manually prepared repository of non-CRAN datasets are dealt with using 2. For either of these, there would be some function intended to be used in the REPL which can import the artifact into a user's One wrinkle is that I have started experimenting in an RDataGet.jl branch with adding 1) to I would like to maybe resolve the |
|
Sorry for barging in, but I'm quite curious about the idea of serving dataset with package server system. Wouldn't that be too much for the package server to cache? I mean as an end user, what I care most about a dataset version is that the format stay the same. I won't be worry if there're some addition/deletion of the samples or other kinds of small changes. OTOH, I could generate multiple wikipedia dump datasets by giving different timestamp, which give you different content hash, but does it make sense to cache them all with the package server? |
We already serve a huge amount of traffic through the package server system so I'm not worried about serving some medium sized datasets. We limit artifact size to 2GB iirc. While the format may be all that matters to you, others want their code to be reproducible in the sense of getting the same results. Artifacts ensure that because they are immutable and content-addressed. |
This repository has quite a few issues asking for more data. There is a large amount of data on CRAN. It's not clear whether the current approach in this package is appropriate for reaching the "long tail" of datasets on CRAN.
As mentioned in this issue #47 (comment) by one measure this package is already complete: it has some data which be used for testing out Julia stats packages. By another measure it cannot be complete until it contains every dataset on CRAN.
Myself, I am rather interested in having more datasets from the fields of Educational Data Mining and Psychometrics -- hence the recent spate of pull requests. One possibility for making sure everyone gets what they need from this package going forward would be to split out all code into RDatasetsBase and create RDatasets with just some "core" datasets. Then, specific domains can be taken care of by their respective Julia orgs e.g. Ecology by EcoJulia, Psychometrics by a new Julia org which each have REcoDatasets, RPsychometricsDatasets and so on.
The text was updated successfully, but these errors were encountered: