Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] cuGraph Documentation Example Errors #1862

Closed
4 tasks
Nicholas-7 opened this issue Sep 30, 2021 · 1 comment · Fixed by #1866
Closed
4 tasks

[WIP] cuGraph Documentation Example Errors #1862

Nicholas-7 opened this issue Sep 30, 2021 · 1 comment · Fixed by #1866
Assignees
Milestone

Comments

@Nicholas-7
Copy link

Describe the bug
cuGraph documentation has errors in some of the examples present. There are also some functionality that are not represented in the documentation. There are examples that relies on code that cannot be ran without errors in the most recent RAPIDS releases. I've began testing going back to 0.19 and there may have been some change to the API before that release. Not sure if these changes were intentional or if it was an uncaught bug.

Steps/Code to reproduce bug
Steps to reproduce the behavior:
1.Go to Rapids cuGraph documentation website
2.Click on the cuGraph API Reference link
3.Run the content for the examples therein as illustrated below

Expected behavior
There will be several examples that will create an error. Many examples miss details that could aide in implementation. The code will be a few commits behind from the 21.08. repo.

Environment details (please complete the following information):

  • Environment location: Paperspace
  • Linux Distro/Architecture: Ubuntu 18.04 amd64
  • GPU Model/Driver: P5000
  • CUDA: 11.0
  • Method of cuDF & cuGraph install:
docker pull rapidsai/rapidsai:21.08-cuda11.0-runtime-ubuntu18.04-py3.7
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 \
rapidsai/rapidsai:21.08-cuda11.0-runtime-ubuntu18.04-py3.7

Additional context
Examples of Discrepancies:
Example # 1

import cugraph
import cudf

M = cudf.read_csv('datasets/bipartite.csv', delimiter=' ',
                  dtype=['int32', 'int32', 'float32'], header=None)
G = cugraph.Graph()
G.from_cudf_edgelist(M, source='0', destination='1', edge_attr='2')
cost, df = cugraph.hungarian(G, workers)
print(cost, df)

Error thrown below:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-31-6556d4126987> in <module>
      1 M = cudf.read_csv('datasets/bipartite.csv', delimiter=' ',
----> 2                   dtype=['int32', 'int32', 'float32'], header=None)
      3 G = cugraph.Graph()
      4 G.from_cudf_edgelist(M, source='0', destination='1', edge_attr='2')
      5 cost, df = cugraph.hungarian(G, workers)

/opt/conda/envs/rapids/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76 

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/csv.py in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
    100         na_filter=na_filter,
    101         prefix=prefix,
--> 102         index_col=index_col,
    103     )
    104 

cudf/_lib/csv.pyx in cudf._lib.csv.read_csv()

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/bipartite.csv'

Where I am not able to test without a testing .csv to determine success

A second instance of the same issue

M = cudf.read_csv('datasets/bipartite.csv', delimiter=' ',
                  dtype=['int32', 'int32', 'float32'], header=None)
G = cugraph.Graph()
G.from_cudf_edgelist(M, source='0', destination='1', edge_attr='2')
cost, df = cugraph.hungarian(G, workers)

Where I am not able to test without a testing .csv to determine success

Error thrown below:

---------------------------------------------------------------------------
FileNotFoundError: [Errno 2] No such file or directory:'edge_list.csv'

Example # 2

import pandas 
import cugraph 

df = pandas.read_csv('datasets/karate.csv', delimiter=' ',
                  dtype=['int32', 'int32', 'float32'], header=None)
G = cugraph.Graph()
G.from_pandas_edgelist(df, source='0', destination='1',
                         edge_attr='2', renumber=False)

Where the .from_pandas_edgelist currently expects 2- or 3-tuples but seemingly used to allow a int32 parameter.

Error thrown below:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-ddbc3d94932d> in <module>
      1 df = pandas.read_csv('datasets/karate.csv', delimiter=' ',
----> 2                   dtype=['int32', 'int32', 'float32'], header=None)
      3 G = cugraph.Graph()
      4 G.from_pandas_edgelist(df, source='0', destination='1',
      5                          edge_attr='2', renumber=False)

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    608     kwds.update(kwds_defaults)
    609 
--> 610     return _read(filepath_or_buffer, kwds)
    611 
    612 

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    460 
    461     # Create the parser.
--> 462     parser = TextFileReader(filepath_or_buffer, **kwds)
    463 
    464     if chunksize or iterator:

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    817             self.options["has_index_names"] = kwds["has_index_names"]
    818 
--> 819         self._engine = self._make_engine(self.engine)
    820 
    821     def close(self):

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1048             )
   1049         # error: Too many arguments for "ParserBase"
-> 1050         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1051 
   1052     def _failover_to_python(self):

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1871 
   1872         try:
-> 1873             self._reader = parsers.TextReader(self.handles.handle, **kwds)
   1874         except Exception:
   1875             self.handles.close()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1797     # raise a consistent TypeError if failed
   1798     try:
-> 1799         npdtype = np.dtype(dtype)
   1800     except SyntaxError as err:
   1801         # np.dtype uses `eval` which can raise SyntaxError

TypeError: Field elements must be 2- or 3-tuples, got ''int32''

Example # 3

Comms.initialize()
chunksize = dcg.get_chunksize(input_data_path)
ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
                             delimiter=' ',
                             names=['src', 'dst', 'weight'],
                             dtype=['int32', 'int32', 'float32'])
sym_ddf = cugraph.symmetrize_ddf(ddf, "src", "dst", "weight")
Comms.destroy()

Where .symmetrize currently expects .value amongst other fixes where it seemingly was not required.  

Error thrown below:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-430eb168b36e> in <module>
----> 1 chunksize = dcg.get_chunksize(input_data_path)
      2 ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
      3                              delimiter=' ',
      4                              names=['src', 'dst', 'weight'],
      5                              dtype=['int32', 'int32', 'float32'])

NameError: name 'input_data_path' is not defined

After adding .values received the below error:

ValueError: Data must be 1-dimensional

Example # 4

import cugraph.dask as dcg
import cudf

chunksize = dcg.get_chunksize(input_data_path)
ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
                             delimiter=' ',
                             names=['src', 'dst', 'value'],
                             dtype=['int32', 'int32', 'float32'])
dg = cugraph.DiGraph()
dg.from_dask_cudf_edgelist(ddf, source='src', destination='dst',
                               edge_attr='value')
pr = dcg.katz_centrality(dg)

Where I am not able to test without a input_data_path to determine success

Error thrown below:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-93-d59924d07d0d> in <module>
      2 #... Init a DASK Cluster
      3 #>>    see https://docs.rapids.ai/api/cugraph/stable/dask-cugraph.html
----> 4 chunksize = dcg.get_chunksize(input_data_path)
      5 ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
      6                              delimiter=' ',

NameError: name 'input_data_path' is not defined

A second instance of the same issue

import cugraph.dask as dcg
import cudf

chunksize = dcg.get_chunksize(input_data_path)
ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
                             delimiter=' ',
                             names=['src', 'dst', 'value'],
                             dtype=['int32', 'int32', 'float32'])
dg = cugraph.DiGraph()
dg.from_dask_cudf_edgelist(ddf, source='src', destination='dst',
                               edge_attr='value')
pr = dcg.katz_centrality(dg)

Error thrown below:

---------------------------------------------------------------------------
NameError: name 'input_data_path' is not defined

A third instance of the same issue

import cugraph.dask as dcg

chunksize = dcg.get_chunksize(input_data_path)
ddf = dask_cudf.read_csv('datasets/karate.csv', chunksize=chunksize,
                             delimiter=' ',
                             names=['src', 'dst', 'value'],
                             dtype=['int32', 'int32', 'float32'])
dg = cugraph.Graph()
dg.from_dask_cudf_edgelist(ddf, source='src', destination='dst',
                               edge_attr='value')
parts, modularity_score = dcg.louvain(dg)

Error thrown below:

---------------------------------------------------------------------------
NameError: name 'input_data_path' is not defined

A fourth instance of the same issue

import cugraph.dask as dcg

chunksize = dcg.get_chunksize(input_data_path)
ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
                             delimiter=' ',
                             names=['src', 'dst', 'value'],
                             dtype=['int32', 'int32', 'float32'])
dg = cugraph.DiGraph()
dg.from_dask_cudf_edgelist(ddf, source='src', destination='dst',
                               edge_attr='value')
pr = dcg.pagerank(dg)

Error thrown below:

---------------------------------------------------------------------------

NameError: name 'input_data_path' is not defined

A fifth instance of the same issue

import cugraph.dask as dcg

chunksize = dcg.get_chunksize(input_data_path)
ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
                             delimiter=' ',
                             names=['src', 'dst', 'value'],
                             dtype=['int32', 'int32', 'float32'])
dg = cugraph.DiGraph()
dg.from_dask_cudf_edgelist(ddf, 'src', 'dst')
df = dcg.bfs(dg, 0)

Error thrown below:

---------------------------------------------------------------------------

NameError: name 'input_data_path' is not defined

Example # 5

import cugraph
import cudf

gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ',
                  dtype=['int32', 'int32', 'float32'], header=None)
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='0', destination='1')
pairs = cugraph.get_two_hop_neighbors(G)
df = cugraph.jaccard(G, pairs)

Where .symmetrize currently expects . get_two_hop_neighbors to work in the above syntax amoungst other fixes where it seemingly is not available

Error thrown below:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-2dfc3353a55e> in <module>
      3 G = cugraph.Graph()
      4 G.from_cudf_edgelist(gdf, source='0', destination='1')
----> 5 pairs = cugraph.get_two_hop_neighbors(G)
      6 df = cugraph.jaccard(G, pairs)

AttributeError: module 'cugraph' has no attribute 'get_two_hop_neighbors'

Example # 6

import cugraph
import cudf

M = cudf.read_csv('datasets/karate.csv', delimiter=' ',
                  dtype=['int32', 'int32', 'float32'], header=None)
G = cugraph.Graph()
G.from_cudf_edgelist(M, source='0', destination='1')
df = cugraph.jaccard_w(G, M[2])

Where cugraph.jaccard currently produces error

Error thrown below:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-38-e0856c39a153> in <module>
      3 G = cugraph.Graph()
      4 G.from_cudf_edgelist(M, source='0', destination='1')
----> 5 df = cugraph.jaccard_w(G, M[2])

/opt/conda/envs/rapids/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76 

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/dataframe.py in __getitem__(self, arg)
    679         """
    680         if _is_scalar_or_zero_d_array(arg) or isinstance(arg, tuple):
--> 681             return self._get_columns_by_label(arg, downcast=True)
    682 
    683         elif isinstance(arg, slice):

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/dataframe.py in _get_columns_by_label(self, labels, downcast)
   1405         If downcast is True, try and downcast from a DataFrame to a Series
   1406         """
-> 1407         new_data = super()._get_columns_by_label(labels, downcast)
   1408         if downcast:
   1409             if is_scalar(labels):

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/frame.py in _get_columns_by_label(self, labels, downcast)
    630 
    631         """
--> 632         return self._data.select_by_label(labels)
    633 
    634     def _get_columns_by_index(self, indices):

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column_accessor.py in select_by_label(self, key)
    344                 if any(isinstance(k, slice) for k in key):
    345                     return self._select_by_label_with_wildcard(key)
--> 346             return self._select_by_label_grouped(key)
    347 
    348     def select_by_index(self, index: Any) -> ColumnAccessor:

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column_accessor.py in _select_by_label_grouped(self, key)
    406 
    407     def _select_by_label_grouped(self, key: Any) -> ColumnAccessor:
--> 408         result = self._grouped_data[key]
    409         if isinstance(result, cudf.core.column.ColumnBase):
    410             return self.__class__({key: result})

KeyError: 2

A second instance of the same issue

M = cudf.read_csv('datasets/karate.csv', delimiter=' ',
                  dtype=['int32', 'int32', 'float32'], header=None)
G = cugraph.Graph()
G.from_cudf_edgelist(M, source='0', destination='1')
df = cugraph.overlap_w(G, M[2])

Error thrown below:

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)
<ipython-input-40-989087ed76b0> in <module>
      3 G = cugraph.Graph()
      4 G.from_cudf_edgelist(M, source='0', destination='1')
----> 5 df = cugraph.overlap_w(G, M[2])

/opt/conda/envs/rapids/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76 

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/dataframe.py in __getitem__(self, arg)
    679         """
    680         if _is_scalar_or_zero_d_array(arg) or isinstance(arg, tuple):
--> 681             return self._get_columns_by_label(arg, downcast=True)
    682 
    683         elif isinstance(arg, slice):

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/dataframe.py in _get_columns_by_label(self, labels, downcast)
   1405         If downcast is True, try and downcast from a DataFrame to a Series
   1406         """
-> 1407         new_data = super()._get_columns_by_label(labels, downcast)
   1408         if downcast:
   1409             if is_scalar(labels):

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/frame.py in _get_columns_by_label(self, labels, downcast)
    630 
    631         """
--> 632         return self._data.select_by_label(labels)
    633 
    634     def _get_columns_by_index(self, indices):

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column_accessor.py in select_by_label(self, key)
    344                 if any(isinstance(k, slice) for k in key):
    345                     return self._select_by_label_with_wildcard(key)
--> 346             return self._select_by_label_grouped(key)
    347 
    348     def select_by_index(self, index: Any) -> ColumnAccessor:

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column_accessor.py in _select_by_label_grouped(self, key)
    406 
    407     def _select_by_label_grouped(self, key: Any) -> ColumnAccessor:
--> 408         result = self._grouped_data[key]
    409         if isinstance(result, cudf.core.column.ColumnBase):
    410             return self.__class__({key: result})

KeyError: 2

Example # 7
There is also opportunity to include additional examples for functionality not represented on this page. Afew examples of what could be missing are below

  • rmat
  • Methods for getting degrees
  • A get_neighbors method using egonet
  • Examples of a get_adjacency, view adjacency methods ie: one that returns a cuPy array

Desired outcome
Examples present in cuGraph documentation should be ready to replicated and implemented with less effort. Examples should reflect the commits made to the repositories during each release cycle. cuGraph functions and models work as expected.

Request impacts
Our cuGraph documentation is public and requires accurate information - Medium Priority

@BradReesWork @taureandyernv for awareness

@Nicholas-7 Nicholas-7 added bug Something isn't working 2 - In Progress ? - Needs Triage Need team to review and classify labels Sep 30, 2021
@BradReesWork BradReesWork added Fix and removed ? - Needs Triage Need team to review and classify bug Something isn't working labels Sep 30, 2021
@BradReesWork BradReesWork added this to the 21.12 milestone Sep 30, 2021
@jnke2016
Copy link
Contributor

jnke2016 commented Sep 30, 2021

Most if not all the errors mentioned are docstring mistakes which we are addressing in a PR. Below I provided a description on how to fix those errors before a PR is out

Example 1:

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/bipartite.csv'

This is due to a non existing directory. we currently do not have a bipartite.csv dataset in our repo but this will be added in the PR.

Example 2:

TypeError: Field elements must be 2- or 3-tuples, got ''int32''

This can be corrected like this. Please refer to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
In fact, dtype takes a dict of column or a type name. Also make sure a path to the datasets exist

df = pandas.read_csv('./datasets/karate.csv', delimiter=' ',
header=None,
names=["0", "1", "2"],
dtype={"0": "int32", "1": "int32", "2": "float32"}
)

Example 3:
similar to example 1, this is due to a non existing directory, please download a datasets from https://github.com/rapidsai/cugraph/datasets/...
Furthermore, this is not the right way to call symmetrize_ddf . You should import it like this
from cugraph.structure.symmetrize import symmetrize_ddf
and call it like this
sym_ddf = symmetrize_ddf(ddf, "src", "dst", "weight")

Example 4:
similar to example 1 and 3

Example 5:
AttributeError: module 'cugraph' has no attribute 'get_two_hop_neighbors'

you should first create a cugraph Di/Graph and then call 'get_two_hop_neighbors' like this G.get_two_hop_neighbors()

Example 6:
There is an error at this line df = cugraph.jaccard_w(G, M[2]) because M should be a cudf.Dataframe containing the vertex identifier with their corresponding weights. Similar with df = cugraph.overlap_w(G, M[2])

Example 7
We Will provide a docstring as well as use case examples in a PR that will be out shortly

rapids-bot bot pushed a commit that referenced this issue Oct 18, 2021
Update cuGraph documentation and examples
closes #1862

Authors:
  - Joseph Nke (https://github.com/jnke2016)

Approvers:
  - Brad Rees (https://github.com/BradReesWork)
  - Rick Ratzel (https://github.com/rlratzel)

URL: #1866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants