moving notebook changes from 22.06 -> 22.08

rapidsai · Jun 1, 2022 · 0a1d86a · 0a1d86a
1 parent 4922450
commit 0a1d86a
Show file tree

Hide file tree

Showing 19 changed files with 49 additions and 56 deletions.
diff --git a/notebooks/centrality/Betweenness.ipynb b/notebooks/centrality/Betweenness.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Betweenness Centrality\n",
     "\n",
-    "In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test datase using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
+    "In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test database using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
     "\n",
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
@@ -25,7 +25,7 @@
    "metadata": {},
    "source": [
     "## Introduction\n",
-    "Betweenness centrality is a measure of the relative importance based on measuring the number of shortest paths that pass through each vertex or over each edge .  High betweenness centrality vertices have a greater number of path cross through the vertex.  Likewise, high centrality edges have more shortest paths that pass over the edge.\n",
+    "Betweenness centrality is a measure of the relative importance based on measuring the number of shortest paths that pass through each vertex or over each edge.  High betweenness centrality vertices have a greater number of path cross through the vertex.  Likewise, high centrality edges have more shortest paths that pass over the edge.\n",
     "\n",
     "See [Betweenness on Wikipedia](https://en.wikipedia.org/wiki/Betweenness_centrality) for more details on the algorithm.\n",
     "\n"
@@ -244,7 +244,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print_top_scores(vertex_bc, \"top vertice centrality scores\")\n",
+    "print_top_scores(vertex_bc, \"top vertex centrality scores\")\n",
     "print_top_scores(edge_bc, \"top edge centrality scores\")"
    ]
   },

diff --git a/notebooks/centrality/Centrality.ipynb b/notebooks/centrality/Centrality.ipynb
@@ -53,7 +53,7 @@
     "    <img src=\"https://latex.codecogs.com/png.latex?c_B(v)&space;=\\sum_{s,t&space;\\in&space;V}&space;\\frac{\\sigma(s,&space;t|v)}{\\sigma(s,&space;t)}\" title=\"c_B(v) =\\sum_{s,t \\in V} \\frac{\\sigma(s, t|v)}{\\sigma(s, t)}\" />\n",
     "</center>\n",
     "\n",
-    "To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores.  For this example, the graphs are relatively smalled (under 5,000 nodes) so betweenness on every node will be computed.\n",
+    "To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores.  For this example, the graphs are relatively small (under 5,000 nodes) so betweenness on every node will be computed.\n",
     "\n",
     "___Eigenvector Centrality - coming soon___ <br>\n",
     "Eigenvectors can be thought of as the balancing points of a graph, or center of gravity of a 3D object.  High centrality means that more of the graph is balanced around that node.\n",
@@ -128,7 +128,7 @@
    "outputs": [],
    "source": [
     "# Compute Centrality\n",
-    "# the centrality calls are very straight forward with the graph being the primary argument\n",
+    "# the centrality calls are very straightforward with the graph being the primary argument\n",
     "# we are using the default argument values for all centrality functions\n",
     "\n",
     "def compute_centrality(_graph) :\n",
@@ -257,7 +257,7 @@
    "metadata": {},
    "source": [
     "### Results\n",
-    "Typically, analyst look just at the top 10% of results.  Basically just those vertices that are the most central or important.  \n",
+    "Typically, analysts just look at the top 10% of results.  Basically just those vertices that are the most central or important.  \n",
     "The karate data has 32 vertices, so let's round a little and look at the top 5 vertices"
    ]
   },

diff --git a/notebooks/centrality/Katz.ipynb b/notebooks/centrality/Katz.ipynb
@@ -214,7 +214,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Call the Karz algorithm"
+    "### Call the Katz algorithm"
    ]
   },
   {

diff --git a/notebooks/community/Louvain.ipynb b/notebooks/community/Louvain.ipynb
@@ -351,7 +351,7 @@
     }
    ],
    "source": [
-    "# How many Lieden partitions where found\n",
+    "# How many Leiden partitions were found\n",
     "part_ids_l = df_l[\"partition\"].unique()\n",
     "print(\"Leiden found \" + str(len(part_ids_l)) + \" partitions\")"
    ]

diff --git a/notebooks/community/Spectral-Clustering.ipynb b/notebooks/community/Spectral-Clustering.ipynb
@@ -187,7 +187,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# The algorithm requires that there are edge weights.  In this case all the weights are being ste to 1\n",
+    "# The algorithm requires that there are edge weights.  In this case all the weights are being set to 1\n",
     "gdf[\"data\"] = cudf.Series(np.ones(len(gdf), dtype=np.float32))"
    ]
   },
@@ -197,7 +197,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Look at the first few data records - the output should be two colums src and dst\n",
+    "# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
     "gdf.head()"
    ]
   },
@@ -234,7 +234,7 @@
    "metadata": {},
    "source": [
     "----\n",
-    "#### Define and print function, but adjust vertex ID so that they match the illustration"
+    "#### Define and print function, but adjust vertex IDs so that they match the illustration"
    ]
   },
   {

diff --git a/notebooks/community/Triangle-Counting.ipynb b/notebooks/community/Triangle-Counting.ipynb
@@ -184,7 +184,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's seet how that compares to cuGraph\n",
+    "Let's see how that compares to cuGraph\n",
     "\n",
     "----"
    ]

diff --git a/notebooks/components/ConnectedComponents.ipynb b/notebooks/components/ConnectedComponents.ipynb
@@ -144,7 +144,7 @@
     "# Test file\n",
     "datafile='../data/netscience.csv'\n",
     "\n",
-    "# the datafile contains three columns,but we only want to use the first two. \n",
+    "# the datafile contains three columns, but we only want to use the first two. \n",
     "# We will use the \"usecols' feature of read_csv to ignore that column\n",
     "\n",
     "gdf = cudf.read_csv(datafile, delimiter=' ', names=['src', 'dst', 'wgt'], dtype=['int32', 'int32', 'float32'], usecols=['src', 'dst'])\n",

diff --git a/notebooks/cores/kcore.ipynb b/notebooks/cores/kcore.ipynb
@@ -220,7 +220,7 @@
    "metadata": {},
    "source": [
     "### Just for fun\n",
-    "Let's try specifying a K value.  Looking at the original network picture, it is easy to see that most vertices has at least degree two.  \n",
+    "Let's try specifying a K value.  Looking at the original network picture, it is easy to see that most vertices have at least degree two.  \n",
     "If we specify k = 2 then only one vertex should be dropped "
    ]
   },

diff --git a/notebooks/demo/batch_betweenness.ipynb b/notebooks/demo/batch_betweenness.ipynb
@@ -16,7 +16,7 @@
    "metadata": {},
    "source": [
     "## Introduction\n",
-    "Betweennes Centrality can be slow to compute on large graphs, in order to speed up the process we can leverage multiple GPUs.\n",
+    "Betweenness Centrality can be slow to compute on large graphs, in order to speed up the process we can leverage multiple GPUs.\n",
     "In this notebook we will showcase how it would have been done with a Single GPU approach, then we will show how it can be done using multiple GPUs."
    ]
   },

diff --git a/notebooks/link_analysis/HITS.ipynb b/notebooks/link_analysis/HITS.ipynb
@@ -185,7 +185,7 @@
    "metadata": {},
    "source": [
     "Running NetworkX is that easy.  \n",
-    "Let's seet how that compares to cuGraph\n",
+    "Let's see how that compares to cuGraph\n",
     "\n",
     "----"
    ]

diff --git a/notebooks/link_analysis/Pagerank.ipynb b/notebooks/link_analysis/Pagerank.ipynb
@@ -160,7 +160,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Read the data, this also created a NetworkX Graph \n",
+    "# Read the data, this also creates a NetworkX Graph \n",
     "file = open(datafile, 'rb')\n",
     "Gnx = nx.read_edgelist(file)"
    ]
@@ -232,7 +232,7 @@
    "metadata": {},
    "source": [
     "Running NetworkX is that easy.  \n",
-    "Let's seet how that compares to cuGraph\n",
+    "Let's see how that compares to cuGraph\n",
     "\n",
     "----"
    ]
@@ -335,7 +335,7 @@
    ],
    "source": [
     "# Find the most important vertex using the scores\n",
-    "# This methods should only be used for small graph\n",
+    "# These methods should only be used for small graph\n",
     "bestScore = gdf_page['pagerank'][0]\n",
     "bestVert = gdf_page['vertex'][0]\n",
     "\n",
@@ -351,7 +351,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The top PageRank vertex and socre match what was found by NetworkX"
+    "The top PageRank vertex and score match what was found by NetworkX"
    ]
   },
   {
@@ -360,7 +360,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# A better way to do that would be to find the max and then use that values in a query\n",
+    "# A better way to do that would be to find the max and then use the values in a query\n",
     "pr_max = gdf_page['pagerank'].max()"
    ]
   },

diff --git a/notebooks/link_prediction/Jaccard-Similarity.ipynb b/notebooks/link_prediction/Jaccard-Similarity.ipynb
@@ -253,7 +253,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Look at the first few data records - the output should be two colums src and dst\n",
+    "# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
     "gdf.head()"
    ]
   },
@@ -311,7 +311,7 @@
    "outputs": [],
    "source": [
     "#%%time\n",
-    "# Call cugraph.nvJaccard \n",
+    "# Call cugraph.nvJaccard\n",
     "jdf = cugraph.jaccard(G)"
    ]
   },
@@ -424,7 +424,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Call Pagerank on the graph to get weights to use:\n",
+    "# Call PageRank on the graph to get weights to use:\n",
     "pr_df = cugraph.pagerank(G)"
    ]
   },
@@ -434,7 +434,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# take a peek at the page rank values\n",
+    "# take a peek at the PageRank values\n",
     "pr_df.head()"
    ]
   },
@@ -451,8 +451,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "pr_df.rename(columns={'pagerank': 'weight'}, inplace=True)",
-    "# Call weighted Jaccard using the Pagerank scores as weights:\n",
+    "pr_df.rename(columns={'pagerank': 'weight'}, inplace=True)\n",
+    "# Call weighted Jaccard using the PageRank scores as weights:\n",
     "wdf = cugraph.jaccard_w(G, pr_df)"
    ]
   },

diff --git a/notebooks/link_prediction/Overlap-Similarity.ipynb b/notebooks/link_prediction/Overlap-Similarity.ipynb
@@ -271,7 +271,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Look at the first few data records - the output should be two colums src and dst\n",
+    "# Look at the first few data records - the output should be two columns: 'src' and 'dst'\n",
     "gdf.head()"
    ]
   },
@@ -467,7 +467,7 @@
    "outputs": [],
    "source": [
     "# print all similarities over a threshold, in this case 0.5\n",
-    "#also, drop duplicates\n",
+    "# also, drop duplicates\n",
     "odf_s2 = ol2.query('source < destination').sort_values(by='overlap_coeff', ascending=False)\n",
     "\n",
     "print_overlap_threshold(odf_s2, 0.74)"

diff --git a/notebooks/sampling/RandomWalk.ipynb b/notebooks/sampling/RandomWalk.ipynb
@@ -162,13 +162,6 @@
     "\n",
     "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

diff --git a/notebooks/structure/Renumber-2.ipynb b/notebooks/structure/Renumber-2.ipynb
@@ -133,17 +133,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The data has 2.5 million edges that span a range of 3,758,096,389 \n",
+    "The data has 2.5 million edges that span a range of 3,758,096,389.\n",
     "Even if every vertex ID was unique per edge, that would only be 5 million values versus the 3.7 billion that is currently there.  \n",
-    "In the curret state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space."
+    "In the current state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Time to Renumber\n",
-    "One good best practice is to have the returned edge pairs appended to the original dataframe. That will help merge results back into the datasets"
+    "One good best practice is to have the returned edge pairs appended to the original Dataframe. That will help merge results back into the datasets"
    ]
   },
   {

diff --git a/notebooks/structure/Renumber.ipynb b/notebooks/structure/Renumber.ipynb
@@ -8,9 +8,9 @@
     "\n",
     "In this notebook, we will use the _renumber_ function to compute new vertex IDs.\n",
     "\n",
-    "Under the covers, cuGraph represents a graph as a matrix in Compressed Sparse Row format (see https://en.wikipedia.org/wiki/Sparse_matrix).  The problem with a matrix representation is that there is a column and row for every possible vertex ID.  Therefore, if the data contains vertex IDs that are non-contiguious, or which start at a large initial value, then there is a lot of empty space that uses up memory.      \n",
+    "Under the covers, cuGraph represents a graph as a matrix in Compressed Sparse Row format (see https://en.wikipedia.org/wiki/Sparse_matrix).  The problem with a matrix representation is that there is a column and row for every possible vertex ID.  Therefore, if the data contains vertex IDs that are non-contiguous, or which start at a large initial value, then there is a lot of empty space that uses up memory.      \n",
     "\n",
-    "An alternative case is using renumbering to convert from one data type down to a contiguious sequence of integer IDs.  This is useful when the dataset contain vertex IDs that are not integers.  \n",
+    "An alternative case is using renumbering to convert from one data type down to a contiguous sequence of integer IDs.  This is useful when the dataset contain vertex IDs that are not integers.  \n",
     "\n",
     "\n",
     "Notebook Credits\n",
@@ -28,19 +28,19 @@
     "\n",
     "Demonstrate creating a graph with renumbering.\n",
     "\n",
-    "Most cugraph algorithms operate on a CSR representation of a graph.  A CSR representation requires an indices array that is as long as the number of edges and an offsets array that is as 1 more than the largest vertex id.  This makes the memory utilization entirely dependent on the size of the largest vertex id.  For data sets that have a sparse range of vertex ids, the size of the CSR can be unnecessarily large.  It is easy to construct an example where the amount of memory required for the offsets array will exceed the amount of memory in the GPU (not to mention the performance cost of having a large number of offsets that are empty but still have to be read to be skipped).\n",
+    "Most cuGraph algorithms operate on a CSR representation of a graph.  A CSR representation requires an indices array that is as long as the number of edges and an offsets array that is as 1 more than the largest vertex id.  This makes the memory utilization entirely dependent on the size of the largest vertex id.  For data sets that have a sparse range of vertex ids, the size of the CSR can be unnecessarily large.  It is easy to construct an example where the amount of memory required for the offsets array will exceed the amount of memory in the GPU (not to mention the performance cost of having a large number of offsets that are empty but still have to be read to be skipped).\n",
     "\n",
     "The renumbering feature allows us to generate unique identifiers for every vertex identified in the input data frame.\n",
     "\n",
-    "Renumbering can happen automatically as part of graph generation.  It can also be done explicitely by the caller, this notebook will provide examples using both techniques.\n",
+    "Renumbering can happen automatically as part of graph generation.  It can also be done explicitly by the caller, this notebook will provide examples using both techniques.\n",
     "\n",
     "The fundamental requirement for the user of the renumbering software is to specify how to identify a vertex.  We will refer to this as the *external* vertex identifier.  This will typically be done by specifying a cuDF DataFrame, and then identifying which columns within the DataFrame constitute source vertices and which columns specify destination columns.\n",
     "\n",
     "Let us consider that a vertex is uniquely defined as a tuple of elements from the rows of a cuDF DataFrame.  The primary restriction is that the number of elements in the tuple must be the same for both source vertices and destination vertices, and that the types of each element in the source tuple must be the same as the corresponding element in the destination tuple.  This restriction is a natural restriction and should be obvious why this is required.\n",
     "\n",
     "Renumbering takes the collection of tuples that uniquely identify vertices in the graph, eliminates duplicates, and assigns integer identifiers to the unique tuples.  These integer identifiers are used as *internal* vertex identifiers within the cuGraph software.\n",
     "\n",
-    "One of the features of the renumbering function is that it maps vertex ids of any size and structure down into a range that fits into 32-bit integers.  The current cugraph algorithms are limited to 32-bit signed integers as vertex ids. and the renumbering feature will allow the caller to translate ids that are 64-bit (or strings, or complex data types) into a densly packed 32-bit array of ids that can be used in cugraph algorithms.  Note that if there are more than 2^31 - 1 unique vertex ids then the renumber method will fail with an error indicating that there are too many vertices to renumber into a 32-bit signed integer."
+    "One of the features of the renumbering function is that it maps vertex ids of any size and structure down into a range that fits into 32-bit integers.  The current cuGraph algorithms are limited to 32-bit signed integers as vertex ids. and the renumbering feature will allow the caller to translate ids that are 64-bit (or strings, or complex data types) into a densely packed 32-bit array of ids that can be used in cuGraph algorithms.  Note that if there are more than 2^31 - 1 unique vertex ids then the renumber method will fail with an error indicating that there are too many vertices to renumber into a 32-bit signed integer."
    ]
   },
   {
@@ -99,7 +99,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Create our GPU data frame"
+    "# Create our GPU Dataframe"
    ]
   },
   {
@@ -246,11 +246,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Convert vertex ids back\n",
+    "# Convert vertex IDs back\n",
     "\n",
     "To be relevant, we probably want the vertex ids converted back into the original ids.  This can be done by the NumberMap object.\n",
     "\n",
-    "Note again, the unrenumber call does not guarantee order.  If order matters you would need to do something to regenerate the desired order."
+    "Note again, the un-renumber call does not guarantee order.  If order matters you would need to do something to regenerate the desired order."
    ]
   },
   {
@@ -268,7 +268,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Try to run jaccard\n",
+    "# Try to run Jaccard\n",
     "\n",
     "Not at all an interesting result, but it demonstrates a more complicated case.  Jaccard returns a coefficient for each edge.  In order to show the original ids we need to add columns to the data frame for each column that contains one of renumbered vertices.  In this case, the columns source and destination contain renumbered vertex ids."
    ]