Skip to content

Files

Latest commit

69ad319 · Oct 26, 2017

History

History
45 lines (33 loc) · 1.98 KB

README.md

File metadata and controls

45 lines (33 loc) · 1.98 KB

spark-udwf-session

a spark custom window function example, to generate session IDs

Usage

Implements a custom window function to create session IDs on user activity.

Sessionization is a common calculation when processing user activity. We want to mark all events belonging to a session if between them there's no time gap greater than T.

Will continue existing session if there is one in the data, or create one if the next event is outside the session duration interval.

// Window specification
val specs = Window.partitionBy(f.col("user")).orderBy(f.col("ts").asc)
// create the session
val res = df.withColumn( "newsession", 
   calculateSession(f.col("ts"), f.col("session"), 
        f.lit(30*60**1000) over specs)  // window duration in ms

result similar to this (UUIDs are randomly generated).

+-----+-------------+------------------------------------+------------------------------------+
|user |ts           |session                             |newsession                          |
+-----+-------------+------------------------------------+------------------------------------+
|user1|1509036478537|f237e656-1e53-4a24-9ad5-2b4576a4125d|f237e656-1e53-4a24-9ad5-2b4576a4125d|
|user1|1509037078537|null                                |f237e656-1e53-4a24-9ad5-2b4576a4125d|
|user1|1509037378537|null                                |f237e656-1e53-4a24-9ad5-2b4576a4125d|
|user1|1509044878537|null                                |9b17a92c-9a0b-430a-bf97-41034e5b6c6c|
|user1|1509046078537|null                                |9b17a92c-9a0b-430a-bf97-41034e5b6c6c|
|user2|1509036778537|null                                |5f4f0005-52f2-41f9-ab7b-ffc69ab1353f|
|user2|1509037378537|null                                |5f4f0005-52f2-41f9-ab7b-ffc69ab1353f|
+-----+-------------+------------------------------------+------------------------------------+

See http://blog.nuvola-tech.com/2017/10/spark-custom-window-function-for-sessionization/ for a detailed explanation.