-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [Documentation] UDF Guide #416
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing this up! Left some initial comments.
@@ -0,0 +1,43 @@ | |||
# User-Defined Functions - C# | |||
This documentation contains user-defined function (UDF) examples. It shows how to define UDFs and how to use UDFs with Row objects as examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation contains user-defined function (UDF) examples. It shows how to define UDFs and how to use UDFs with Row objects as examples. | |
A user-defined function, or UDF, is a routine that can take in parameters, perform some sort of calculation, and then return a result. This document explains how to construct UDFs in C# and includes example functions, such as how to use UDFs with Row objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are UDFs applicable to any C# app, or just .NET for Spark apps? If they're just used in .NET for Spark apps, I'd add a sentence or two explaining how UDFs apply to/are useful in .NET for Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we're focusing on Row object examples? Could we include other examples and then make this intro more general (i.e. "...This document explains how to construct UDFs in C# and includes example functions.")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are UDFs applicable to any C# app, or just .NET for Spark apps? If they're just used in .NET for Spark apps, I'd add a sentence or two explaining how UDFs apply to/are useful in .NET for Spark.
I think we talk about UDF used within .NET for Spark here.
## Pre-requisites: | ||
Install Microsoft.Spark.Worker. When you want to execute a C# UDF, Spark needs to understand how to launch the .NET CLR to execute this UDF. Microsoft.Spark.Worker provides a collection of classes to Spark that enable this functionality. Please see more details at [how to install Microsoft.Spark.Worker](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started#5-install-net-for-apache-spark) and [how to deploy worker and UDF binaries](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries). | ||
|
||
## UDF that takes in Row objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only showing examples of UDFs with Row objects? It seems like it'd be valuable to have this document explain how to write any UDF and show examples of all (or at least more types) of UDFs?
Or is the goal of this doc to only show Row-based UDFs (in this case, we should change the title and intro of the doc to reflect that, because right now it seems like it should explain all UDFs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the purpose of this doc is using UDF with Row objects
readme file. This goes with the recent PR which exposes the UDF that returns Row
objects. I think we can add more types later. @imback82 what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can start with UDFs with Row
since there are few gotchas with them, and we can expand this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can start with UDFs with
Row
since there are few gotchas with them, and we can expand this.
Sounds good!
Install Microsoft.Spark.Worker. When you want to execute a C# UDF, Spark needs to understand how to launch the .NET CLR to execute this UDF. Microsoft.Spark.Worker provides a collection of classes to Spark that enable this functionality. Please see more details at [how to install Microsoft.Spark.Worker](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started#5-install-net-for-apache-spark) and [how to deploy worker and UDF binaries](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries). | ||
|
||
## UDF that takes in Row objects | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some sentences providing context to this example?
For instance, as a reader, I have the following questions:
- When would I use a UDF that takes in Row objects (as opposed to other types of UDFs)?
- Do all UDFs just take in or return Row objects (since that's all that is shown in this doc)?
- What is the goal of this code? What calculation or filtering is it performing and why?
- What would be the output of this code?
- Is this the only way to define UDFs (using
Func<> myUdf = Udf<>(...)
)? What aboutspark.Udf().Register<>...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion! I was looking at UDF docs here. I am not sure how much detail we want to go with this intro guide. Should we just consider this as a using UDF with Row objects
readme file or UDF tutorial? This goes with your previous question also.
``` | ||
|
||
## UDF that returns Row objects | ||
Please note that `GenericRow` objects need to be used here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same questions as above, so I think it'd be great to provide some additional context here. Also, why does GenericRow
need to be used here?
df.Select(udf(df["id"])).Show(); | ||
``` | ||
|
||
## Chained UDF with Row objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, it would be great to add some context/explanation. What is a scenario when I'd need to chain UDFs? What does this code do?
```csharp | ||
// Chained UDF using udf1 and udf2 defined above. | ||
df.Select(udf1(udf2(df["id"]))).Show(); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a Next Steps
or Resources
or Wrap Up
section at the end could be really helpful. i.e., "If you'd like to see more examples of UDFs in action, check out our XYZ examples in the .NET for Apache Spark GitHub repo."
Co-Authored-By: Brigit Murtaugh <[email protected]>
This PR documents user-defined function guide using
Row
object as examples ( which is implemented via #376), including how to define UDFs, how to use UDFs with DataFrame and etc.