Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datasets API to import graph data from configuration/metadata files #2367

Merged
merged 84 commits into from
Jul 25, 2022
Merged
Show file tree
Hide file tree
Changes from 83 commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
37e111e
Created initial datasets classes and first json metadata
betochimas Jun 7, 2022
5abac2e
test comment
Jun 7, 2022
a4929df
init method for metadata can read json
Jun 7, 2022
ec6333d
create dolphin. add refs field
Jun 7, 2022
5a25495
add author field
Jun 7, 2022
9235652
Add initial small datasets to namespace
betochimas Jun 8, 2022
5fc06d4
minor detalis to json
Jun 8, 2022
d074c25
add temp url to karate. read metadata
Jun 8, 2022
06f4d04
Initial testing, plus comments
betochimas Jun 8, 2022
ee34f87
convert json to yaml
Jun 8, 2022
7432db4
change extention to yaml
Jun 8, 2022
9b671ee
remove unneeded test file
Jun 8, 2022
717dbbe
remove properties
Jun 8, 2022
6e32c66
remove metadata. fleshed out get_edgelist()
Jun 13, 2022
608517e
implement __download_csv
Jun 13, 2022
acc3a4a
Created datasets_config yaml file plus other edits
betochimas Jun 14, 2022
8251f97
Passing initial test, generalized some dataset variables
betochimas Jun 14, 2022
57e8a3a
update yaml datatypes. prevent yaml sorting. reading config and stori…
Jun 14, 2022
ac0ca61
Yaml edits plus adding new metadata files
betochimas Jun 15, 2022
c65a522
Removed a bug where graph was always set to directed=False
Jun 15, 2022
a927b96
writing method to load_all datasets from metadata/ dir
Jun 15, 2022
023aae5
Merge branch 'rapidsai:branch-22.08' into datasets-api-docs
betochimas Jun 16, 2022
1bcbc42
Created a general load_all() method to fetch all available datasets
Jun 16, 2022
b0df1ab
Merge branch 'rapidsai:branch-22.08' into datasets-api
betochimas Jun 16, 2022
89d2752
Merge branch 'rapidsai:branch-22.08' into datasets-api-docs
betochimas Jun 16, 2022
c22a3ab
Merge pull request #86 from betochimas/datasets-api-docs
betochimas Jun 16, 2022
8d95be7
Update docs
betochimas Jun 17, 2022
6fc469f
Metadata files are read-only, load_all works as long as target dir ex…
betochimas Jun 17, 2022
c8505d3
Modify setup.py to include config yaml files, wip
betochimas Jun 21, 2022
9e8cae4
Re-designed 'path' usage. Removed from Metadata files and stored loca…
Jun 21, 2022
1b31683
Build now accepts package data
betochimas Jun 21, 2022
0d5019b
Testing improvements, style check edits, added option to datasets con…
betochimas Jun 22, 2022
2bc230a
Merge branch 'datasets-api' into merge-22.08-datasets
betochimas Jun 22, 2022
2525333
Merge pull request #88 from betochimas/merge-22.08-datasets
betochimas Jun 22, 2022
3fa3c54
Passing basic testing, CI checks passing
betochimas Jun 22, 2022
3f07032
adding manual vs api graph comparisons
Jun 22, 2022
63dc448
adding cyber.yaml metadata
Jun 22, 2022
4bd61b3
style cleanups
Jun 22, 2022
168c367
Merge pull request #93 from betochimas/branch-22.08
betochimas Jun 23, 2022
7282bd9
Merge branch 'datasets-api-docs' into merge-22.08-api-docs
betochimas Jun 23, 2022
ad19d2e
Merge pull request #94 from betochimas/merge-22.08-api-docs
betochimas Jun 23, 2022
092512a
Merge branch 'datasets-api-docs' into datasets-api
betochimas Jun 23, 2022
294c854
Create other helpers
betochimas Jun 23, 2022
dbb4139
add support for custom config files. check for directory before fetching
Jun 23, 2022
d9ff2cb
Add custom yamls to configs and metadata dir
betochimas Jun 23, 2022
11d0ba3
building more support for absolute paths in load_all
Jun 24, 2022
4fe893b
style fix
Jun 24, 2022
16073e7
moving set_config. still wip
Jun 27, 2022
21e474c
Merge branch 'rapidsai:branch-22.08' into datasets-api
oorliu Jun 27, 2022
152d7f8
style fix
nv-rliu Jun 27, 2022
22e6158
Merge branch 'datasets-api' of https://github.com/betochimas/cugraph …
nv-rliu Jun 27, 2022
c10cd04
style fix
Jun 27, 2022
38460e4
fixed PosixPath object -> String conversion error
Jun 28, 2022
5900fe6
test
Jun 28, 2022
a4a280f
fix bfs docstring. add line #
Jun 28, 2022
0215ed0
Merge branch 'rapidsai:branch-22.08' into datasets-api
oorliu Jun 30, 2022
7a7066a
Adding more dataset lists. Fallback to previous Docstring. load_all f…
Jun 30, 2022
7207934
Adding updated unit test
Jun 30, 2022
8478602
Adding new datasets
Jun 30, 2022
1e589e6
Merge branch 'rapidsai:branch-22.08' into datasets-api
oorliu Jun 30, 2022
a1efb5d
Merge branch 'datasets-api' of https://github.com/betochimas/cugraph …
Jul 3, 2022
e34d32b
Adding two dataset metadata files
Jul 5, 2022
8c04714
Supporting new dataset objects
Jul 5, 2022
05cfa0d
Adding support for environment variables
Jul 5, 2022
65f2105
Add testing
Jul 5, 2022
8b03f8b
Style fix
Jul 5, 2022
899cfe2
Merge branch 'rapidsai:branch-22.08' into datasets-api
oorliu Jul 5, 2022
78c8823
Add Karate-Asymmetric dataset
Jul 5, 2022
9750f9d
Merge branch 'rapidsai:branch-22.08' into datasets-api
oorliu Jul 6, 2022
05f93e0
Merge branch 'datasets-api' of https://github.com/betochimas/cugraph …
Jul 6, 2022
c58d317
Allowing tests to fetch
Jul 6, 2022
ec47682
style fix
Jul 6, 2022
c891b3b
Fix test_get_path
Jul 7, 2022
d5a7d09
test_get_path Fix
Jul 7, 2022
e428ee7
test_get_path Fix
Jul 7, 2022
ae3a896
Merge branch 'rapidsai:branch-22.08' into datasets-api
oorliu Jul 11, 2022
20c948b
Add karate_asymmetric metadata
Jul 11, 2022
d68814e
Updating __init__.py
Jul 11, 2022
747a2ba
Updating tests: Add coverage for Conda install dirs
Jul 11, 2022
684f1cb
Update for PR
Jul 19, 2022
b0eb9e7
Remove _update_dl
Jul 19, 2022
713a6c8
Feedback updates
Jul 21, 2022
1f6459c
Style fix
Jul 21, 2022
bf8fdd9
Moving meta_path to __init__
Jul 24, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
include python/versioneer.py
include python/cugraph/_version.py
include python/cugraph/_version.py
include cugraph/experimental/datasets/*.yaml
include cugraph/experimental/datasets/metadata/*.yaml
2 changes: 2 additions & 0 deletions python/cugraph/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
include versioneer.py
include cugraph/_version.py
include cugraph/experimental/datasets/*.yaml
include cugraph/experimental/datasets/metadata/*.yaml
2 changes: 2 additions & 0 deletions python/cugraph/cugraph/experimental/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,5 @@
find_bicliques = deprecated_warning_wrapper(
experimental_warning_wrapper(EXPERIMENTAL__find_bicliques)
)

from cugraph.experimental.datasets.dataset import Dataset
48 changes: 48 additions & 0 deletions python/cugraph/cugraph/experimental/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


from cugraph.experimental.datasets.dataset import (
Dataset,
load_all,
set_config,
set_download_dir,
get_download_dir,
default_download_dir
)
from cugraph.experimental.datasets import metadata


karate = Dataset("metadata/karate.yaml")
karate_undirected = Dataset("metadata/karate_undirected.yaml")
karate_asymmetric = Dataset("metadata/karate_asymmetric.yaml")
dolphins = Dataset("metadata/dolphins.yaml")
polbooks = Dataset("metadata/polbooks.yaml")
netscience = Dataset("metadata/netscience.yaml")
cyber = Dataset("metadata/cyber.yaml")
small_line = Dataset("metadata/small_line.yaml")
small_tree = Dataset("metadata/small_tree.yaml")


# LARGE DATASETS
LARGE_DATASETS = [cyber]

# <10,000 lines
MEDIUM_DATASETS = [netscience, polbooks]

# <500 lines
SMALL_DATASETS = [karate, small_line, small_tree, dolphins]

# ALL
ALL_DATASETS = [karate, dolphins, netscience, polbooks, cyber,
small_line, small_tree]
210 changes: 210 additions & 0 deletions python/cugraph/cugraph/experimental/datasets/dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import cugraph
import cudf
import yaml
import os
from pathlib import Path


class DefaultDownloadDir:
rlratzel marked this conversation as resolved.
Show resolved Hide resolved
"""
Maintains the path to the download directory used by Dataset instances.
Instances of this class are typically shared by several Dataset instances
in order to allow for the download directory to be defined and updated by
a single object.
"""
def __init__(self):
self._path = Path(os.environ.get("RAPIDS_DATASET_ROOT_DIR",
Path.home() / ".cugraph/datasets"))

@property
def path(self):
"""
If `path` is not set, set it to the environment variable
RAPIDS_DATASET_ROOT_DIR. If the variable is not set, default to the
user's home directory.
"""
if self._path is None:
rlratzel marked this conversation as resolved.
Show resolved Hide resolved
self._path = Path(os.environ.get("RAPIDS_DATASET_ROOT_DIR",
Path.home() /
".cugraph/datasets"))
return self._path

@path.setter
def path(self, new):
self._path = Path(new)

def clear(self):
self._path = None


default_download_dir = DefaultDownloadDir()


class Dataset:
"""
A Dataset Object, used to easily import edgelist data and cuGraph.Graph
instances.

Parameters
----------
meta_data_file_name : yaml file
The metadata file for the specific graph dataset, which includes
information on the name, type, url link, data loading format, graph
properties

"""
def __init__(self, meta_data_file_name):
self.__read_meta_data_file(meta_data_file_name)
self._dl_path = default_download_dir
self._edgelist = None
self._graph = None
self._path = None

def __read_meta_data_file(self, meta_data_file):
metadata_path = Path(__file__).parent.absolute() / meta_data_file
with open(metadata_path, 'r') as file:
self.metadata = yaml.safe_load(file)

def __download_csv(self, url):
self._dl_path.path.mkdir(parents=True, exist_ok=True)

filename = self.metadata['name'] + self.metadata['file_type']
if self._dl_path.path.is_dir():
df = cudf.read_csv(url)
df.to_csv(self._dl_path.path / filename, index=False)

else:
raise RuntimeError(f"The directory {self._dl_path.path.absolute()}"
"does not exist")

def get_edgelist(self, fetch=False):
"""
Return an Edgelist

Parameters
----------
fetch : Boolean (default=False)
Automatically fetch for the dataset from the 'url' location within
the YAML file.
"""

if self._edgelist is None:
full_path = self._dl_path.path / (self.metadata['name'] +
self.metadata['file_type'])

if not full_path.is_file():
if fetch:
self.__download_csv(self.metadata['url'])
else:
raise RuntimeError(f"The datafile {full_path} does not"
" exist. Try get_edgelist(fetch=True)"
" to download the datafile")

self._edgelist = cudf.read_csv(full_path,
delimiter=self.metadata['delim'],
names=self.metadata['col_names'],
dtype=self.metadata['col_types'])
self._path = full_path

return self._edgelist

def get_graph(self, fetch=False):
"""
Return a Graph object.

Parameters
----------
fetch : Boolean (default=False)
Automatically fetch for the dataset from the 'url' location within
the YAML file.
"""
if self._edgelist is None:
self.get_edgelist(fetch)

self._graph = cugraph.Graph(directed=self.metadata['is_directed'])
self._graph.from_cudf_edgelist(self._edgelist, source='src',
destination='dst')

return self._graph

def get_path(self):
"""
Returns the location of the stored dataset file
"""
if self._path is None:
raise RuntimeError("Path to datafile has not been set." +
" Call get_edgelist or get_graph first")

return self._path.absolute()


def load_all(force=False):
"""
Looks in `metadata` directory and fetches all datafiles from the the URLs
provided in each YAML file.

Parameters
force : Boolean (default=False)
Overwrite any existing copies of datafiles.
"""
default_download_dir.path.mkdir(parents=True, exist_ok=True)

meta_path = Path(__file__).parent.absolute() / "metadata"
for file in os.listdir(meta_path):
rlratzel marked this conversation as resolved.
Show resolved Hide resolved
meta = None
if file.endswith('.yaml'):
rlratzel marked this conversation as resolved.
Show resolved Hide resolved
with open(meta_path / file, 'r') as metafile:
meta = yaml.safe_load(metafile)

if 'url' in meta:
filename = meta['name'] + meta['file_type']
save_to = default_download_dir.path / filename
if not save_to.is_file() or force:
df = cudf.read_csv(meta['url'])
df.to_csv(save_to, index=False)


def set_config(cfgpath):
"""
Read in a custom config file.

Parameters
----------
cfgfile : String
Read the custom config file given its path, and override the default
"""
with open(Path(cfgpath), 'r') as file:
cfg = yaml.safe_load(file)
default_download_dir.path = Path(cfg['download_dir'])


def set_download_dir(path):
"""
Set the download directory for fetching datasets

Parameters
----------
path : String
Location used to store datafiles
"""
if path is None:
default_download_dir.clear()
else:
default_download_dir.path = path


def get_download_dir():
return default_download_dir.path.absolute()
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
fetch: "False"
force: "False"
# path where datasets will be downloaded to and stored
download_dir: "datasets"
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: cyber
file_type: .csv
author: N/A
url: https://raw.githubusercontent.com/rapidsai/cugraph/branch-22.08/datasets/cyber.csv
refs: N/A
col_names:
- idx
- src
- dst
col_types:
- int32
- str
- str
delim: ","
has_loop: true
is_directed: true
is_multigraph: false
is_symmetric: false
number_of_edges: 54
number_of_nodes: 314
number_of_lines: 2546576
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: dolphins
file_type: .csv
author: D. Lusseau
url: https://raw.githubusercontent.com/rapidsai/cugraph/branch-22.08/datasets/dolphins.csv
refs:
D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson,
The bottlenose dolphin community of Doubtful Sound features a large proportion of
long-lasting associations, Behavioral Ecology and Sociobiology 54, 396-405 (2003).
col_names:
- src
- dst
- wgt
col_types:
- int32
- int32
- float32
delim: " "
has_loop: false
is_directed: true
is_multigraph: false
is_symmetric: false
number_of_edges: 318
number_of_nodes: 62
number_of_lines: 318
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: karate-data
file_type: .csv
author: Zachary W.
url: https://raw.githubusercontent.com/rapidsai/cugraph/branch-22.08/datasets/karate-data.csv
refs:
W. W. Zachary, An information flow model for conflict and fission in small groups,
Journal of Anthropological Research 33, 452-473 (1977).
delim: "\t"
col_names:
- src
- dst
col_types:
- int32
- int32
has_loop: true
is_directed: true
is_multigraph: false
is_symmetric: true
number_of_edges: 156
number_of_nodes: 34
number_of_lines: 156
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: karate-asymmetric
file_type: .csv
author: Zachary W.
url: https://raw.githubusercontent.com/rapidsai/cugraph/branch-22.08/datasets/karate-asymmetric.csv
refs:
W. W. Zachary, An information flow model for conflict and fission in small groups,
Journal of Anthropological Research 33, 452-473 (1977).
delim: "\t"
col_names:
- src
- dst
- wgt
col_types:
- int32
- int32
- float32
has_loop: true
is_directed: false
is_multigraph: false
is_symmetric: false
number_of_edges: 78
number_of_nodes: 34
number_of_lines: 78
Loading