Skip to content

Commit

Permalink
Fix pyspark docs (#48)
Browse files Browse the repository at this point in the history
* Fix pyspark docs

* revert blog

* update readme version

---------

Co-authored-by: ohad <[email protected]>
  • Loading branch information
uzadude and ohad-definity authored Dec 23, 2024
1 parent 6fd6fc4 commit c962d5e
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 21 deletions.
19 changes: 7 additions & 12 deletions datafu-spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,19 +34,17 @@ In order to call the datafu-spark API's from Pyspark, you can do the following (
First, call pyspark with the following parameters

```bash
export PYTHONPATH=datafu-spark_2.11-1.8.0.jar
export PYTHONPATH=datafu-spark_2.12-2.0.0.jar

pyspark --jars datafu-spark_2.11-1.8.0.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.11-1.8.0.jar
pyspark --jars datafu-spark_2.12-2.0.0.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.12-2.0.0.jar
```

The following is an example of calling the Spark version of the datafu _dedup_ method

```python
from pyspark_utils.df_utils import PySparkDFUtils
from pyspark_utils import df_utils

df_utils = PySparkDFUtils()

df_people = sqlContext.createDataFrame([
df_people = spark.createDataFrame([
("a", "Alice", 34),
("a", "Sara", 33),
("b", "Bob", 36),
Expand All @@ -57,12 +55,9 @@ df_people = sqlContext.createDataFrame([
("c", "Zoey", 36)],
["id", "name", "age"])

func_dedup_res = df_utils.dedup_with_order(dataFrame=df_people, groupCol=df_people.id,
orderCols=[df_people.age.desc(), df_people.name.desc()])

func_dedup_res.registerTempTable("dedup")

func_dedup_res.show()
df_dedup = df_utils.dedup_with_order(df=df_people, group_col=df_people.id,
order_cols=[df_people.age.desc(), df_people.name.desc()])
df_dedup.show()
```

This should produce the following output
Expand Down
19 changes: 10 additions & 9 deletions site/source/docs/spark/guide.html.markdown.erb
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,9 @@ pyspark --jars datafu-spark_2.12-<%= current_page.data.version %>-SNAPSHOT.jar -
The following is an example of calling the Spark version of the datafu _dedup_ method

```python
from pyspark_utils.df_utils import PySparkDFUtils
from pyspark_utils import df_utils

df_utils = PySparkDFUtils()

df_people = sqlContext.createDataFrame([
df_people = spark.createDataFrame([
("a", "Alice", 34),
("a", "Sara", 33),
("b", "Bob", 36),
Expand All @@ -58,12 +56,15 @@ df_people = sqlContext.createDataFrame([
("c", "Zoey", 36)],
["id", "name", "age"])

func_dedup_res = df_utils.dedup_with_order(dataFrame=df_people, groupCol=df_people.id,
orderCols=[df_people.age.desc(), df_people.name.desc()])

func_dedup_res.registerTempTable("dedup")
df_dedup = df_utils.dedup_with_order(df=df_people, group_col=df_people.id,
order_cols=[df_people.age.desc(), df_people.name.desc()])
df_dedup.show()

func_dedup_res.show()
# or with activate()
df_utils.activate()
df_dedup_top_n = df_people.dedup_top_n(n=2, group_col=df_people.id,
order_cols=[df_people.age.desc(), df_people.name.desc()])
df_dedup_top_n.show()
```

This should produce the following output
Expand Down

0 comments on commit c962d5e

Please sign in to comment.