Skip to content

Commit

Permalink
moving notebook changes from 22.06 -> 22.08
Browse files Browse the repository at this point in the history
  • Loading branch information
nv-rliu committed Jun 1, 2022
1 parent 4922450 commit 0a1d86a
Show file tree
Hide file tree
Showing 19 changed files with 49 additions and 56 deletions.
6 changes: 3 additions & 3 deletions notebooks/centrality/Betweenness.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"# Betweenness Centrality\n",
"\n",
"In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test datase using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
"In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test database using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
"\n",
"Notebook Credits\n",
"* Original Authors: Bradley Rees\n",
Expand All @@ -25,7 +25,7 @@
"metadata": {},
"source": [
"## Introduction\n",
"Betweenness centrality is a measure of the relative importance based on measuring the number of shortest paths that pass through each vertex or over each edge . High betweenness centrality vertices have a greater number of path cross through the vertex. Likewise, high centrality edges have more shortest paths that pass over the edge.\n",
"Betweenness centrality is a measure of the relative importance based on measuring the number of shortest paths that pass through each vertex or over each edge. High betweenness centrality vertices have a greater number of path cross through the vertex. Likewise, high centrality edges have more shortest paths that pass over the edge.\n",
"\n",
"See [Betweenness on Wikipedia](https://en.wikipedia.org/wiki/Betweenness_centrality) for more details on the algorithm.\n",
"\n"
Expand Down Expand Up @@ -244,7 +244,7 @@
"metadata": {},
"outputs": [],
"source": [
"print_top_scores(vertex_bc, \"top vertice centrality scores\")\n",
"print_top_scores(vertex_bc, \"top vertex centrality scores\")\n",
"print_top_scores(edge_bc, \"top edge centrality scores\")"
]
},
Expand Down
6 changes: 3 additions & 3 deletions notebooks/centrality/Centrality.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
" <img src=\"https://latex.codecogs.com/png.latex?c_B(v)&space;=\\sum_{s,t&space;\\in&space;V}&space;\\frac{\\sigma(s,&space;t|v)}{\\sigma(s,&space;t)}\" title=\"c_B(v) =\\sum_{s,t \\in V} \\frac{\\sigma(s, t|v)}{\\sigma(s, t)}\" />\n",
"</center>\n",
"\n",
"To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores. For this example, the graphs are relatively smalled (under 5,000 nodes) so betweenness on every node will be computed.\n",
"To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores. For this example, the graphs are relatively small (under 5,000 nodes) so betweenness on every node will be computed.\n",
"\n",
"___Eigenvector Centrality - coming soon___ <br>\n",
"Eigenvectors can be thought of as the balancing points of a graph, or center of gravity of a 3D object. High centrality means that more of the graph is balanced around that node.\n",
Expand Down Expand Up @@ -128,7 +128,7 @@
"outputs": [],
"source": [
"# Compute Centrality\n",
"# the centrality calls are very straight forward with the graph being the primary argument\n",
"# the centrality calls are very straightforward with the graph being the primary argument\n",
"# we are using the default argument values for all centrality functions\n",
"\n",
"def compute_centrality(_graph) :\n",
Expand Down Expand Up @@ -257,7 +257,7 @@
"metadata": {},
"source": [
"### Results\n",
"Typically, analyst look just at the top 10% of results. Basically just those vertices that are the most central or important. \n",
"Typically, analysts just look at the top 10% of results. Basically just those vertices that are the most central or important. \n",
"The karate data has 32 vertices, so let's round a little and look at the top 5 vertices"
]
},
Expand Down
2 changes: 1 addition & 1 deletion notebooks/centrality/Katz.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Call the Karz algorithm"
"### Call the Katz algorithm"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion notebooks/community/Louvain.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,7 @@
}
],
"source": [
"# How many Lieden partitions where found\n",
"# How many Leiden partitions were found\n",
"part_ids_l = df_l[\"partition\"].unique()\n",
"print(\"Leiden found \" + str(len(part_ids_l)) + \" partitions\")"
]
Expand Down
6 changes: 3 additions & 3 deletions notebooks/community/Spectral-Clustering.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@
"metadata": {},
"outputs": [],
"source": [
"# The algorithm requires that there are edge weights. In this case all the weights are being ste to 1\n",
"# The algorithm requires that there are edge weights. In this case all the weights are being set to 1\n",
"gdf[\"data\"] = cudf.Series(np.ones(len(gdf), dtype=np.float32))"
]
},
Expand All @@ -197,7 +197,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Look at the first few data records - the output should be two colums src and dst\n",
"# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
"gdf.head()"
]
},
Expand Down Expand Up @@ -234,7 +234,7 @@
"metadata": {},
"source": [
"----\n",
"#### Define and print function, but adjust vertex ID so that they match the illustration"
"#### Define and print function, but adjust vertex IDs so that they match the illustration"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion notebooks/community/Triangle-Counting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's seet how that compares to cuGraph\n",
"Let's see how that compares to cuGraph\n",
"\n",
"----"
]
Expand Down
2 changes: 1 addition & 1 deletion notebooks/components/ConnectedComponents.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@
"# Test file\n",
"datafile='../data/netscience.csv'\n",
"\n",
"# the datafile contains three columns,but we only want to use the first two. \n",
"# the datafile contains three columns, but we only want to use the first two. \n",
"# We will use the \"usecols' feature of read_csv to ignore that column\n",
"\n",
"gdf = cudf.read_csv(datafile, delimiter=' ', names=['src', 'dst', 'wgt'], dtype=['int32', 'int32', 'float32'], usecols=['src', 'dst'])\n",
Expand Down
2 changes: 1 addition & 1 deletion notebooks/cores/kcore.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@
"metadata": {},
"source": [
"### Just for fun\n",
"Let's try specifying a K value. Looking at the original network picture, it is easy to see that most vertices has at least degree two. \n",
"Let's try specifying a K value. Looking at the original network picture, it is easy to see that most vertices have at least degree two. \n",
"If we specify k = 2 then only one vertex should be dropped "
]
},
Expand Down
2 changes: 1 addition & 1 deletion notebooks/demo/batch_betweenness.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"metadata": {},
"source": [
"## Introduction\n",
"Betweennes Centrality can be slow to compute on large graphs, in order to speed up the process we can leverage multiple GPUs.\n",
"Betweenness Centrality can be slow to compute on large graphs, in order to speed up the process we can leverage multiple GPUs.\n",
"In this notebook we will showcase how it would have been done with a Single GPU approach, then we will show how it can be done using multiple GPUs."
]
},
Expand Down
2 changes: 1 addition & 1 deletion notebooks/link_analysis/HITS.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@
"metadata": {},
"source": [
"Running NetworkX is that easy. \n",
"Let's seet how that compares to cuGraph\n",
"Let's see how that compares to cuGraph\n",
"\n",
"----"
]
Expand Down
10 changes: 5 additions & 5 deletions notebooks/link_analysis/Pagerank.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Read the data, this also created a NetworkX Graph \n",
"# Read the data, this also creates a NetworkX Graph \n",
"file = open(datafile, 'rb')\n",
"Gnx = nx.read_edgelist(file)"
]
Expand Down Expand Up @@ -232,7 +232,7 @@
"metadata": {},
"source": [
"Running NetworkX is that easy. \n",
"Let's seet how that compares to cuGraph\n",
"Let's see how that compares to cuGraph\n",
"\n",
"----"
]
Expand Down Expand Up @@ -335,7 +335,7 @@
],
"source": [
"# Find the most important vertex using the scores\n",
"# This methods should only be used for small graph\n",
"# These methods should only be used for small graph\n",
"bestScore = gdf_page['pagerank'][0]\n",
"bestVert = gdf_page['vertex'][0]\n",
"\n",
Expand All @@ -351,7 +351,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The top PageRank vertex and socre match what was found by NetworkX"
"The top PageRank vertex and score match what was found by NetworkX"
]
},
{
Expand All @@ -360,7 +360,7 @@
"metadata": {},
"outputs": [],
"source": [
"# A better way to do that would be to find the max and then use that values in a query\n",
"# A better way to do that would be to find the max and then use the values in a query\n",
"pr_max = gdf_page['pagerank'].max()"
]
},
Expand Down
12 changes: 6 additions & 6 deletions notebooks/link_prediction/Jaccard-Similarity.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Look at the first few data records - the output should be two colums src and dst\n",
"# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
"gdf.head()"
]
},
Expand Down Expand Up @@ -311,7 +311,7 @@
"outputs": [],
"source": [
"#%%time\n",
"# Call cugraph.nvJaccard \n",
"# Call cugraph.nvJaccard\n",
"jdf = cugraph.jaccard(G)"
]
},
Expand Down Expand Up @@ -424,7 +424,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Call Pagerank on the graph to get weights to use:\n",
"# Call PageRank on the graph to get weights to use:\n",
"pr_df = cugraph.pagerank(G)"
]
},
Expand All @@ -434,7 +434,7 @@
"metadata": {},
"outputs": [],
"source": [
"# take a peek at the page rank values\n",
"# take a peek at the PageRank values\n",
"pr_df.head()"
]
},
Expand All @@ -451,8 +451,8 @@
"metadata": {},
"outputs": [],
"source": [
"pr_df.rename(columns={'pagerank': 'weight'}, inplace=True)",
"# Call weighted Jaccard using the Pagerank scores as weights:\n",
"pr_df.rename(columns={'pagerank': 'weight'}, inplace=True)\n",
"# Call weighted Jaccard using the PageRank scores as weights:\n",
"wdf = cugraph.jaccard_w(G, pr_df)"
]
},
Expand Down
4 changes: 2 additions & 2 deletions notebooks/link_prediction/Overlap-Similarity.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Look at the first few data records - the output should be two colums src and dst\n",
"# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
"gdf.head()"
]
},
Expand Down Expand Up @@ -467,7 +467,7 @@
"outputs": [],
"source": [
"# print all similarities over a threshold, in this case 0.5\n",
"#also, drop duplicates\n",
"# also, drop duplicates\n",
"odf_s2 = ol2.query('source < destination').sort_values(by='overlap_coeff', ascending=False)\n",
"\n",
"print_overlap_threshold(odf_s2, 0.74)"
Expand Down
7 changes: 0 additions & 7 deletions notebooks/sampling/RandomWalk.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -162,13 +162,6 @@
"\n",
"Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
6 changes: 3 additions & 3 deletions notebooks/structure/Renumber-2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -133,17 +133,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The data has 2.5 million edges that span a range of 3,758,096,389 \n",
"The data has 2.5 million edges that span a range of 3,758,096,389.\n",
"Even if every vertex ID was unique per edge, that would only be 5 million values versus the 3.7 billion that is currently there. \n",
"In the curret state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space."
"In the current state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Time to Renumber\n",
"One good best practice is to have the returned edge pairs appended to the original dataframe. That will help merge results back into the datasets"
"One good best practice is to have the returned edge pairs appended to the original Dataframe. That will help merge results back into the datasets"
]
},
{
Expand Down
18 changes: 9 additions & 9 deletions notebooks/structure/Renumber.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@
"\n",
"In this notebook, we will use the _renumber_ function to compute new vertex IDs.\n",
"\n",
"Under the covers, cuGraph represents a graph as a matrix in Compressed Sparse Row format (see https://en.wikipedia.org/wiki/Sparse_matrix). The problem with a matrix representation is that there is a column and row for every possible vertex ID. Therefore, if the data contains vertex IDs that are non-contiguious, or which start at a large initial value, then there is a lot of empty space that uses up memory. \n",
"Under the covers, cuGraph represents a graph as a matrix in Compressed Sparse Row format (see https://en.wikipedia.org/wiki/Sparse_matrix). The problem with a matrix representation is that there is a column and row for every possible vertex ID. Therefore, if the data contains vertex IDs that are non-contiguous, or which start at a large initial value, then there is a lot of empty space that uses up memory. \n",
"\n",
"An alternative case is using renumbering to convert from one data type down to a contiguious sequence of integer IDs. This is useful when the dataset contain vertex IDs that are not integers. \n",
"An alternative case is using renumbering to convert from one data type down to a contiguous sequence of integer IDs. This is useful when the dataset contain vertex IDs that are not integers. \n",
"\n",
"\n",
"Notebook Credits\n",
Expand All @@ -28,19 +28,19 @@
"\n",
"Demonstrate creating a graph with renumbering.\n",
"\n",
"Most cugraph algorithms operate on a CSR representation of a graph. A CSR representation requires an indices array that is as long as the number of edges and an offsets array that is as 1 more than the largest vertex id. This makes the memory utilization entirely dependent on the size of the largest vertex id. For data sets that have a sparse range of vertex ids, the size of the CSR can be unnecessarily large. It is easy to construct an example where the amount of memory required for the offsets array will exceed the amount of memory in the GPU (not to mention the performance cost of having a large number of offsets that are empty but still have to be read to be skipped).\n",
"Most cuGraph algorithms operate on a CSR representation of a graph. A CSR representation requires an indices array that is as long as the number of edges and an offsets array that is as 1 more than the largest vertex id. This makes the memory utilization entirely dependent on the size of the largest vertex id. For data sets that have a sparse range of vertex ids, the size of the CSR can be unnecessarily large. It is easy to construct an example where the amount of memory required for the offsets array will exceed the amount of memory in the GPU (not to mention the performance cost of having a large number of offsets that are empty but still have to be read to be skipped).\n",
"\n",
"The renumbering feature allows us to generate unique identifiers for every vertex identified in the input data frame.\n",
"\n",
"Renumbering can happen automatically as part of graph generation. It can also be done explicitely by the caller, this notebook will provide examples using both techniques.\n",
"Renumbering can happen automatically as part of graph generation. It can also be done explicitly by the caller, this notebook will provide examples using both techniques.\n",
"\n",
"The fundamental requirement for the user of the renumbering software is to specify how to identify a vertex. We will refer to this as the *external* vertex identifier. This will typically be done by specifying a cuDF DataFrame, and then identifying which columns within the DataFrame constitute source vertices and which columns specify destination columns.\n",
"\n",
"Let us consider that a vertex is uniquely defined as a tuple of elements from the rows of a cuDF DataFrame. The primary restriction is that the number of elements in the tuple must be the same for both source vertices and destination vertices, and that the types of each element in the source tuple must be the same as the corresponding element in the destination tuple. This restriction is a natural restriction and should be obvious why this is required.\n",
"\n",
"Renumbering takes the collection of tuples that uniquely identify vertices in the graph, eliminates duplicates, and assigns integer identifiers to the unique tuples. These integer identifiers are used as *internal* vertex identifiers within the cuGraph software.\n",
"\n",
"One of the features of the renumbering function is that it maps vertex ids of any size and structure down into a range that fits into 32-bit integers. The current cugraph algorithms are limited to 32-bit signed integers as vertex ids. and the renumbering feature will allow the caller to translate ids that are 64-bit (or strings, or complex data types) into a densly packed 32-bit array of ids that can be used in cugraph algorithms. Note that if there are more than 2^31 - 1 unique vertex ids then the renumber method will fail with an error indicating that there are too many vertices to renumber into a 32-bit signed integer."
"One of the features of the renumbering function is that it maps vertex ids of any size and structure down into a range that fits into 32-bit integers. The current cuGraph algorithms are limited to 32-bit signed integers as vertex ids. and the renumbering feature will allow the caller to translate ids that are 64-bit (or strings, or complex data types) into a densely packed 32-bit array of ids that can be used in cuGraph algorithms. Note that if there are more than 2^31 - 1 unique vertex ids then the renumber method will fail with an error indicating that there are too many vertices to renumber into a 32-bit signed integer."
]
},
{
Expand Down Expand Up @@ -99,7 +99,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create our GPU data frame"
"# Create our GPU Dataframe"
]
},
{
Expand Down Expand Up @@ -246,11 +246,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Convert vertex ids back\n",
"# Convert vertex IDs back\n",
"\n",
"To be relevant, we probably want the vertex ids converted back into the original ids. This can be done by the NumberMap object.\n",
"\n",
"Note again, the unrenumber call does not guarantee order. If order matters you would need to do something to regenerate the desired order."
"Note again, the un-renumber call does not guarantee order. If order matters you would need to do something to regenerate the desired order."
]
},
{
Expand All @@ -268,7 +268,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Try to run jaccard\n",
"# Try to run Jaccard\n",
"\n",
"Not at all an interesting result, but it demonstrates a more complicated case. Jaccard returns a coefficient for each edge. In order to show the original ids we need to add columns to the data frame for each column that contains one of renumbered vertices. In this case, the columns source and destination contain renumbered vertex ids."
]
Expand Down
Loading

0 comments on commit 0a1d86a

Please sign in to comment.