Fix pyspark docs (#48)

* Fix pyspark docs * revert blog * update readme version --------- Co-authored-by: ohad <[email protected]>
apache · Dec 23, 2024 · c962d5e · c962d5e
1 parent 6fd6fc4
commit c962d5e
Show file tree

Hide file tree

Showing 2 changed files with 17 additions and 21 deletions.
diff --git a/datafu-spark/README.md b/datafu-spark/README.md
@@ -34,19 +34,17 @@ In order to call the datafu-spark API's from Pyspark, you can do the following (
 First, call pyspark with the following parameters
 
 ```bash
-export PYTHONPATH=datafu-spark_2.11-1.8.0.jar
+export PYTHONPATH=datafu-spark_2.12-2.0.0.jar
 
-pyspark --jars datafu-spark_2.11-1.8.0.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.11-1.8.0.jar
+pyspark --jars datafu-spark_2.12-2.0.0.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.12-2.0.0.jar
 ```
 
 The following is an example of calling the Spark version of the datafu _dedup_ method
 
 ```python
-from pyspark_utils.df_utils import PySparkDFUtils
+from pyspark_utils import df_utils
 
-df_utils = PySparkDFUtils()
-
-df_people = sqlContext.createDataFrame([
+df_people = spark.createDataFrame([
      ("a", "Alice", 34),
      ("a", "Sara", 33),
      ("b", "Bob", 36),
@@ -57,12 +55,9 @@ df_people = sqlContext.createDataFrame([
      ("c", "Zoey", 36)],
      ["id", "name", "age"])
 
-func_dedup_res = df_utils.dedup_with_order(dataFrame=df_people, groupCol=df_people.id,
-                                orderCols=[df_people.age.desc(), df_people.name.desc()])
-
-func_dedup_res.registerTempTable("dedup")
-
-func_dedup_res.show()
+df_dedup = df_utils.dedup_with_order(df=df_people, group_col=df_people.id,
+                                     order_cols=[df_people.age.desc(), df_people.name.desc()])
+df_dedup.show()
 ```
 
 This should produce the following output

diff --git a/site/source/docs/spark/guide.html.markdown.erb b/site/source/docs/spark/guide.html.markdown.erb
@@ -43,11 +43,9 @@ pyspark --jars datafu-spark_2.12-<%= current_page.data.version %>-SNAPSHOT.jar -
 The following is an example of calling the Spark version of the datafu _dedup_ method
 
 ```python
-from pyspark_utils.df_utils import PySparkDFUtils
+from pyspark_utils import df_utils
 
-df_utils = PySparkDFUtils()
-
-df_people = sqlContext.createDataFrame([
+df_people = spark.createDataFrame([
      ("a", "Alice", 34),
      ("a", "Sara", 33),
      ("b", "Bob", 36),
@@ -58,12 +56,15 @@ df_people = sqlContext.createDataFrame([
      ("c", "Zoey", 36)],
      ["id", "name", "age"])
 
-func_dedup_res = df_utils.dedup_with_order(dataFrame=df_people, groupCol=df_people.id,
-                                orderCols=[df_people.age.desc(), df_people.name.desc()])
-
-func_dedup_res.registerTempTable("dedup")
+df_dedup = df_utils.dedup_with_order(df=df_people, group_col=df_people.id,
+                                     order_cols=[df_people.age.desc(), df_people.name.desc()])
+df_dedup.show()
 
-func_dedup_res.show()
+# or with activate()
+df_utils.activate()
+df_dedup_top_n = df_people.dedup_top_n(n=2, group_col=df_people.id,
+                                       order_cols=[df_people.age.desc(), df_people.name.desc()])
+df_dedup_top_n.show()
 ```
 
 This should produce the following output