[ENH] Add a `datasets` API and cleanup existing datasets #1348

rlratzel · 2021-01-21T01:45:50Z

CuGraph could benefit from a new datasets API that allows users to easily create Graph objects from pre-defined datasets. Currently, users have to read a file using cuDF to create an edgelist, then create a Graph instance from the edgelist. This is a multi-step process that is very common, yet not always easy to remember. Streamlining this very common process will benefit users as well as our tests and benchmarks.

The datasets directory also needs to be cleaned up a bit and the README updated.

Consolidate redundant subdirs and remove unused: “ref”, “test/ref”, “highbench_small/1”
Update README (if necessary)
Update download script (if necessary)

For all of the above, the Python and C++ test and benchmarks need to be checked to see if they are referencing any dataset that might be a moved or renamed as part of the cleanup.

The text was updated successfully, but these errors were encountered:

betochimas · 2022-06-07T18:58:24Z

Some tasks for this project:

Write Dataset and Metadata class ([ENH] datasets API: Creating an Edgelist #2351 )
Write datasets/init.py and meta-data files for each dataset ([ENH] datasets API: Creating an Edgelist #2351)
Remove unused datasets across all layers of cugraph
Testing ([ENH] datasets API: Creating an Edgelist #2351)
Update docstrings
Update notebooks
If necessary, write notebook and/or section in html docs demoing API

oorliu · 2022-06-07T22:02:52Z

I fleshed out the __init__ method for MetaData to open a JSON file and store its info.

betochimas · 2022-06-13T15:29:36Z

Some requirements for this API:

Dataset class takes in a metadata yaml file, containing important information about the dataset including a download URL, graph properties (number_of_nodes, is_directed, etc), column name and types, file type (CSV, others to come)
Dataset object has two core methods get_edgelist and get_graph, which retrieves the edgelist (as a cudf DataFrame) and cugraph.Graph object respectively. The latter calls the former, because get_edgelist may end up fetching the dataset from the URL (and storing it as a data file on disk if fetch var is set True)
Configuration options for fetching and downloading datasets will be set in datasets_config.yaml, found in experimental/datasets. Users can edit this file according to their needs, but they can also add their own config yaml file in cugraph
Default location will be datasets, found on root, but can be customized
If path is defined within a metadata file, cugraph will look there first if the corresponding dataset file exists. If not, then will fetch the data from the URL (and download it depending on config settings). path will be overwritten if dataset is downloaded (UPDATE: path is defined as an attribute of a datasets object, and can be updated if dataset is downloaded. This is so that the metadata yaml files are read-only and aren't overwritten, as path was the only reason why we were dumping updated info the the yaml)

Example of use:

import cugraph
from cugraph.datasets import karate
G = karate.get_graph()
cugraph.algorithm(G)

oorliu · 2022-06-14T18:51:48Z

Status update:

get_edgelist() works for the most part. The "path" field within the .yaml file will direct datasets.py where to read in the datafile. If it cannot find the file, it will fetch from the "url" specified within the metadata.
- Raises RuntimeError if fetch=False and datafile does not exist.
get_graph() will check for the existence of an edgelist. If not found, it will call get_edgelist() and pass along the fetch variable.
Next step: provide datasets_configuration.yaml to specify default storage location when fetching from the web. The location that a datafile is written to will be updated in "path" so datasets.py will reuse local datasets.
- Suggestion: Writing a method to load all data available from ./metadata, or to verify local data integrity?

betochimas · 2022-06-14T21:17:59Z

@cjnolet @dantegd @wphicks
As of now, these are current requirements for the datasets API. The issue addressing the first set of changes is #2351
We appreciate any feedback as we work on making this tool compatible across cugraph and others

betochimas · 2022-06-16T19:24:52Z

In today's sync, we agreed on ensuring that the metadata and configuration .yaml files are input-only (user provides them if they want to load custom datasets or modify configs) and are not writeable. The path to the datasets will then be stored as an attribute to the datasets object, and will not be part of the metadata in the .yaml file. We introduced load_all, a utility function outside of datasets class that fetches all datafiles from the metadata files from within the metadata directory. This can be used for validation, and downloading datasets all together

oorliu · 2022-06-21T18:46:05Z

Today, we decided to revamp the usage of the "path" field within the datasets API. Previously, Metadata files would have a read-only "path" field, which tells the API to look for its datafile in that location. Now, the datasets_config.yaml will specify a "download_dir", which is where all datafiles will both be saved-to and read-from.
Metadata objects are now given a local "path" attribute, which can be used to inform the user where its data is being stored.

betochimas · 2022-06-25T00:01:37Z

Status update:

get_path returns the path to where the dataset is stored in the form of a PosixPath object, if downloaded. Otherwise will return None, indicating that the dataset could not be found
load_all is a utility function that loads all metadata yamls from within metadata and downloads them if missing. Custom metadata yamls can be dropped in that directory in addition to the defaults.
While a default datasets_config.yaml specifies default storage location and whether to automatically fetch, these properties shared across the Datasets instances can be overwritten if set_config is called with another configuration yaml file. Like load_all, this is not a method from within the Datasets class.
An environment variable like DOWNLOAD_DIR, IS_FETCH could be used to overwrite fetch/download_dir settings, similar to a custom config yaml. If so, the ranking of priority would be:
(Highest to Lowest) API call to set_config -> Environment Variable defaults -> datasets_config.yaml defaults
The "path" field within the individual metadata yamls has been moved to be an attribute of a Dataset instance
This PR is close to review, and while using .csv files only, will serve as the basis for future Datasets API work
Notebooks PR by @oorliu is after [ENH] datasets API: Creating an Edgelist #2351 is closed, and significant progress has been made on that subtask
Docstrings PR is still a WIP, and will uniformly revamp the docstrings with cleaner demo friendly examples
Other details: the datasets that are to be downloaded, we assume that they are clean, as this API does not perform data cleaning; get_graph from Datasets produces an undirected cuGraph.Graph instance: does not support MG Graph creation, although the complexity of get_graph can be revisited

oorliu · 2022-06-27T14:29:48Z

As of now, the following datasets are supported by the datasets API:

karate-data.csv
karate-asymmetric
karate-undirected
dolphins.csv
netscience.csv
polbooks.csv
cyber.csv
small_tree.csv
small_line.csv

oorliu · 2022-07-06T15:48:48Z

Status Update:

get_path returns the absolute path to where the dataset is stored. It is set when the dataset object reads in an edge-list. If this step has not been done, it will raise a RuntimeError
set_download_dir is a external helper method that allows a user to specify where to downloaded datasets to
get_download_dir can be used to view what the current download location is set to
set_config is an external helper method where a user can specify a custom config.yaml file, which will be read in and override any current settings of the API. Ex. download directory, default fetch behavior, etc.
The API now supports the creation of directories, as well as nested directories. If I call set_download_dir and use /path/that/doesn't/exist, the API will attempt to create that path when downloading datasets.
- There is currently an effort to specify if directory creation should be allowed through the configuration file.
The API will check for a RAPIDS_DATASET_ROOT_DIR environment variable upon initialization. This is another way of setting the download directory of the datasets API.

The download location hierarchy:

The user calls set_download_dir("some/directory")
The environment variable RAPIDS_DATASET_ROOT_DIR is set
The user calls set_config("config_file_location")
Download to the user's home directory; specifically <HOME>/.cugraph/datasets

…es (#2367) Addresses issue [#1348](https://nvidia.slack.com/archives/C01SCT7ELMR). A working version of the datasets API has been added under the "experimental" module of cuGraph. This API comes with the ability to import a handful of built-in datasets to create graphs and edge lists. Each dataset comes with its own metadata file in the format of a YAML file. These files contain general information about the dataset, as well as formatting information about their columns and datatypes. Authors: - Dylan Chima-Sanchez (https://github.com/betochimas) - Ralph Liu (https://github.com/oorliu) Approvers: - Rick Ratzel (https://github.com/rlratzel) - Joseph Nke (https://github.com/jnke2016) URL: #2367

rlratzel · 2022-07-25T12:28:06Z

#2367 closes this

rlratzel added ? - Needs Triage Need team to review and classify tests labels Jan 21, 2021

rlratzel assigned rlratzel and jnke2016 Jan 21, 2021

BradReesWork removed the ? - Needs Triage Need team to review and classify label Jan 26, 2021

BradReesWork modified the milestones: 0.18, 0.19 Jan 26, 2021

github-actions bot added the inactive-30d label Mar 3, 2021

BradReesWork removed the inactive-30d label Mar 29, 2021

BradReesWork modified the milestones: 0.19, 0.20 Mar 29, 2021

github-actions bot added the inactive-30d label Apr 28, 2021

rlratzel modified the milestones: 21.06, 21.08 May 26, 2021

BradReesWork removed the inactive-30d label Jul 28, 2021

BradReesWork modified the milestones: 21.08, 21.10 Jul 28, 2021

BradReesWork removed this from the 22.06 milestone Jun 2, 2022

rlratzel assigned oorliu Jun 6, 2022

rlratzel added this to the 22.08 milestone Jun 6, 2022

rlratzel changed the title ~~[ENH] Initial dataset cleanup~~ [ENH] Add a datasets API and cleanup existing datasets Jun 6, 2022

rlratzel added benchmarks improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed inactive-30d labels Jun 6, 2022

rlratzel assigned betochimas Jun 8, 2022

oorliu mentioned this issue Jun 13, 2022

[ENH] datasets API: Creating an Edgelist #2351

Closed

betochimas mentioned this issue Jun 16, 2022

[ENH] datasets API: Update docstrings #2361

Closed

oorliu mentioned this issue Jun 17, 2022

[ENH] datasets API: Update Notebook examples #2364

Closed

rlratzel unassigned betochimas Jul 6, 2022

oorliu mentioned this issue Jul 14, 2022

[ENH] datasets API: Update Unit Tests #2408

Closed

rlratzel mentioned this issue Jul 25, 2022

Add datasets API to import graph data from configuration/metadata files #2367

Merged

rlratzel closed this as completed Jul 25, 2022

rlratzel mentioned this issue Jun 24, 2023

Store Nx results for cuGraph using a Resultset object containing expected test results #2416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Add a `datasets` API and cleanup existing datasets #1348

[ENH] Add a `datasets` API and cleanup existing datasets #1348

rlratzel commented Jan 21, 2021 •

edited

Loading

betochimas commented Jun 7, 2022 •

edited

Loading

oorliu commented Jun 7, 2022

betochimas commented Jun 13, 2022 •

edited

Loading

oorliu commented Jun 14, 2022

betochimas commented Jun 14, 2022

betochimas commented Jun 16, 2022 •

edited

Loading

oorliu commented Jun 21, 2022

betochimas commented Jun 25, 2022 •

edited

Loading

oorliu commented Jun 27, 2022 •

edited

Loading

oorliu commented Jul 6, 2022

rlratzel commented Jul 25, 2022

[ENH] Add a datasets API and cleanup existing datasets #1348

[ENH] Add a datasets API and cleanup existing datasets #1348

Comments

rlratzel commented Jan 21, 2021 • edited Loading

betochimas commented Jun 7, 2022 • edited Loading

oorliu commented Jun 7, 2022

betochimas commented Jun 13, 2022 • edited Loading

oorliu commented Jun 14, 2022

betochimas commented Jun 14, 2022

betochimas commented Jun 16, 2022 • edited Loading

oorliu commented Jun 21, 2022

betochimas commented Jun 25, 2022 • edited Loading

oorliu commented Jun 27, 2022 • edited Loading

oorliu commented Jul 6, 2022

rlratzel commented Jul 25, 2022

[ENH] Add a `datasets` API and cleanup existing datasets #1348

[ENH] Add a `datasets` API and cleanup existing datasets #1348

rlratzel commented Jan 21, 2021 •

edited

Loading

betochimas commented Jun 7, 2022 •

edited

Loading

betochimas commented Jun 13, 2022 •

edited

Loading

betochimas commented Jun 16, 2022 •

edited

Loading

betochimas commented Jun 25, 2022 •

edited

Loading

oorliu commented Jun 27, 2022 •

edited

Loading