Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add a datasets API and cleanup existing datasets #1348

Closed
rlratzel opened this issue Jan 21, 2021 · 12 comments
Closed

[ENH] Add a datasets API and cleanup existing datasets #1348

rlratzel opened this issue Jan 21, 2021 · 12 comments
Assignees
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Milestone

Comments

@rlratzel
Copy link
Contributor

rlratzel commented Jan 21, 2021

CuGraph could benefit from a new datasets API that allows users to easily create Graph objects from pre-defined datasets. Currently, users have to read a file using cuDF to create an edgelist, then create a Graph instance from the edgelist. This is a multi-step process that is very common, yet not always easy to remember. Streamlining this very common process will benefit users as well as our tests and benchmarks.

The datasets directory also needs to be cleaned up a bit and the README updated.

  • Consolidate redundant subdirs and remove unused: “ref”, “test/ref”, “highbench_small/1”
  • Update README (if necessary)
  • Update download script (if necessary)

For all of the above, the Python and C++ test and benchmarks need to be checked to see if they are referencing any dataset that might be a moved or renamed as part of the cleanup.

@rlratzel rlratzel added ? - Needs Triage Need team to review and classify tests labels Jan 21, 2021
@BradReesWork BradReesWork removed the ? - Needs Triage Need team to review and classify label Jan 26, 2021
@BradReesWork BradReesWork modified the milestones: 0.18, 0.19 Jan 26, 2021
@BradReesWork BradReesWork modified the milestones: 0.19, 0.20 Mar 29, 2021
@rlratzel rlratzel modified the milestones: 21.06, 21.08 May 26, 2021
@BradReesWork BradReesWork modified the milestones: 21.08, 21.10 Jul 28, 2021
@BradReesWork BradReesWork removed this from the 22.06 milestone Jun 2, 2022
@rlratzel rlratzel added this to the 22.08 milestone Jun 6, 2022
@rlratzel rlratzel changed the title [ENH] Initial dataset cleanup [ENH] Add a datasets API and cleanup existing datasets Jun 6, 2022
@rlratzel rlratzel added benchmarks improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed inactive-30d labels Jun 6, 2022
@betochimas
Copy link
Contributor

betochimas commented Jun 7, 2022

Some tasks for this project:

@oorliu
Copy link
Contributor

oorliu commented Jun 7, 2022

I fleshed out the __init__ method for MetaData to open a JSON file and store its info.

@betochimas
Copy link
Contributor

betochimas commented Jun 13, 2022

Some requirements for this API:

  • Dataset class takes in a metadata yaml file, containing important information about the dataset including a download URL, graph properties (number_of_nodes, is_directed, etc), column name and types, file type (CSV, others to come)
  • Dataset object has two core methods get_edgelist and get_graph, which retrieves the edgelist (as a cudf DataFrame) and cugraph.Graph object respectively. The latter calls the former, because get_edgelist may end up fetching the dataset from the URL (and storing it as a data file on disk if fetch var is set True)
  • Configuration options for fetching and downloading datasets will be set in datasets_config.yaml, found in experimental/datasets. Users can edit this file according to their needs, but they can also add their own config yaml file in cugraph
  • Default location will be datasets, found on root, but can be customized
  • If path is defined within a metadata file, cugraph will look there first if the corresponding dataset file exists. If not, then will fetch the data from the URL (and download it depending on config settings). path will be overwritten if dataset is downloaded (UPDATE: path is defined as an attribute of a datasets object, and can be updated if dataset is downloaded. This is so that the metadata yaml files are read-only and aren't overwritten, as path was the only reason why we were dumping updated info the the yaml)

Example of use:

import cugraph
from cugraph.datasets import karate
G = karate.get_graph()
cugraph.algorithm(G)

@oorliu
Copy link
Contributor

oorliu commented Jun 14, 2022

Status update:

  • get_edgelist() works for the most part. The "path" field within the .yaml file will direct datasets.py where to read in the datafile. If it cannot find the file, it will fetch from the "url" specified within the metadata.
    • Raises RuntimeError if fetch=False and datafile does not exist.
  • get_graph() will check for the existence of an edgelist. If not found, it will call get_edgelist() and pass along the fetch variable.
  • Next step: provide datasets_configuration.yaml to specify default storage location when fetching from the web. The location that a datafile is written to will be updated in "path" so datasets.py will reuse local datasets.
    • Suggestion: Writing a method to load all data available from ./metadata, or to verify local data integrity?

@betochimas
Copy link
Contributor

@cjnolet @dantegd @wphicks
As of now, these are current requirements for the datasets API. The issue addressing the first set of changes is #2351
We appreciate any feedback as we work on making this tool compatible across cugraph and others

@betochimas
Copy link
Contributor

betochimas commented Jun 16, 2022

In today's sync, we agreed on ensuring that the metadata and configuration .yaml files are input-only (user provides them if they want to load custom datasets or modify configs) and are not writeable. The path to the datasets will then be stored as an attribute to the datasets object, and will not be part of the metadata in the .yaml file. We introduced load_all, a utility function outside of datasets class that fetches all datafiles from the metadata files from within the metadata directory. This can be used for validation, and downloading datasets all together

@oorliu
Copy link
Contributor

oorliu commented Jun 21, 2022

Today, we decided to revamp the usage of the "path" field within the datasets API. Previously, Metadata files would have a read-only "path" field, which tells the API to look for its datafile in that location. Now, the datasets_config.yaml will specify a "download_dir", which is where all datafiles will both be saved-to and read-from.
Metadata objects are now given a local "path" attribute, which can be used to inform the user where its data is being stored.

@betochimas
Copy link
Contributor

betochimas commented Jun 25, 2022

Status update:

  • get_path returns the path to where the dataset is stored in the form of a PosixPath object, if downloaded. Otherwise will return None, indicating that the dataset could not be found
  • load_all is a utility function that loads all metadata yamls from within metadata and downloads them if missing. Custom metadata yamls can be dropped in that directory in addition to the defaults.
  • While a default datasets_config.yaml specifies default storage location and whether to automatically fetch, these properties shared across the Datasets instances can be overwritten if set_config is called with another configuration yaml file. Like load_all, this is not a method from within the Datasets class.
  • An environment variable like DOWNLOAD_DIR, IS_FETCH could be used to overwrite fetch/download_dir settings, similar to a custom config yaml. If so, the ranking of priority would be:
  • (Highest to Lowest) API call to set_config -> Environment Variable defaults -> datasets_config.yaml defaults
  • The "path" field within the individual metadata yamls has been moved to be an attribute of a Dataset instance
  • This PR is close to review, and while using .csv files only, will serve as the basis for future Datasets API work
  • Notebooks PR by @oorliu is after [ENH] datasets API: Creating an Edgelist #2351 is closed, and significant progress has been made on that subtask
  • Docstrings PR is still a WIP, and will uniformly revamp the docstrings with cleaner demo friendly examples
  • Other details: the datasets that are to be downloaded, we assume that they are clean, as this API does not perform data cleaning; get_graph from Datasets produces an undirected cuGraph.Graph instance: does not support MG Graph creation, although the complexity of get_graph can be revisited

@oorliu
Copy link
Contributor

oorliu commented Jun 27, 2022

As of now, the following datasets are supported by the datasets API:

  • karate-data.csv
  • karate-asymmetric
  • karate-undirected
  • dolphins.csv
  • netscience.csv
  • polbooks.csv
  • cyber.csv
  • small_tree.csv
  • small_line.csv

@oorliu
Copy link
Contributor

oorliu commented Jul 6, 2022

Status Update:

  • get_path returns the absolute path to where the dataset is stored. It is set when the dataset object reads in an edge-list. If this step has not been done, it will raise a RuntimeError
  • set_download_dir is a external helper method that allows a user to specify where to downloaded datasets to
  • get_download_dir can be used to view what the current download location is set to
  • set_config is an external helper method where a user can specify a custom config.yaml file, which will be read in and override any current settings of the API. Ex. download directory, default fetch behavior, etc.
  • The API now supports the creation of directories, as well as nested directories. If I call set_download_dir and use /path/that/doesn't/exist, the API will attempt to create that path when downloading datasets.
    • There is currently an effort to specify if directory creation should be allowed through the configuration file.
  • The API will check for a RAPIDS_DATASET_ROOT_DIR environment variable upon initialization. This is another way of setting the download directory of the datasets API.

The download location hierarchy:

  1. The user calls set_download_dir("some/directory")
  2. The environment variable RAPIDS_DATASET_ROOT_DIR is set
  3. The user calls set_config("config_file_location")
  4. Download to the user's home directory; specifically <HOME>/.cugraph/datasets

rapids-bot bot pushed a commit that referenced this issue Jul 25, 2022
…es (#2367)

Addresses issue [#1348](https://nvidia.slack.com/archives/C01SCT7ELMR). A working version of the datasets API has been added under the "experimental" module of cuGraph. This API comes with the ability to import a handful of built-in datasets to create graphs and edge lists. Each dataset comes with its own metadata file in the format of a YAML file. These files contain general information about the dataset, as well as formatting information about their columns and datatypes.

Authors:
  - Dylan Chima-Sanchez (https://github.com/betochimas)
  - Ralph Liu (https://github.com/oorliu)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)
  - Joseph Nke (https://github.com/jnke2016)

URL: #2367
@rlratzel
Copy link
Contributor Author

#2367 closes this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

No branches or pull requests

5 participants