udf schema inference should not set column nullable=false #1885

vkrot-exos · 2020-11-04T08:37:16Z

When udf is applied to dataframe and no return schema hint is provided - schema is infered based on data sample and makes wrong assumption about columns nullability. Looks like if all values in some column are not null - resulting schema nullable=false.
But this is a wrong assumption, cause only a sample of data was examined.
Sample code leading to runtime error:

import databricks.koalas as ks
import numpy as np

data = list()
for i in range(1, 10000):
  data.append((str(i % 100), np.nan if i % 9999 == 0 else float(i),))

sdf = spark.createDataFrame(data, 'a string, b float').repartition(10)
kdf = sdf.to_koalas()

def f(df):
  return df

df = kdf\
  .groupby('a')\
  .apply(f)

df.to_spark().printSchema()
df[df['b'] == -1]

printSchema output:

root
 |-- a: string (nullable = false)
 |-- b: float (nullable = false)

Running this code with koalas version 1.3 throws error:

Job aborted due to stage failure: Task 0 in stage 67.0 failed 4 times, most recent failure: Lost task 0.3 in stage 67.0 (TID 435, 10.147.224.152, executor 0): java.lang.IllegalStateException: Value at index is null

The text was updated successfully, but these errors were encountered:

HyukjinKwon · 2020-11-09T06:42:42Z

As a workaround, you can specify the schema in the type hints (https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hinting-with-names) which should be more performant.

This PR proposes to use nullable schema in when to apply a function. We take some to infer the schema. Usually types are matched but null or NaN is sparse often. This behaviour is batch with Spark's JSON schema inference as well. ```python from pyspark.sql import SparkSession import databricks.koalas as ks import numpy as np spark = SparkSession.builder.getOrCreate() data = list() for i in range(1, 10000): data.append((str(i % 100), np.nan if i % 9999 == 0 else float(i),)) sdf = spark.createDataFrame(data, 'a string, b float').repartition(10) kdf = sdf.to_koalas() def f(df): return df df = kdf\ .groupby('a')\ .apply(f) df.to_spark().printSchema() df[df['b'] == -1] ``` **Before:** ``` root |-- a: string (nullable = false) |-- b: float (nullable = false) java.lang.IllegalStateException: Value at index is null ... ``` **After:** ``` root |-- a: string (nullable = true) |-- b: float (nullable = true) Empty DataFrame Columns: [a, b] Index: [] ``` Resolves #1885

HyukjinKwon added the bug Something isn't working label Nov 9, 2020

HyukjinKwon mentioned this issue Nov 9, 2020

Use nullable inferred schema in function apply #1897

Merged

ueshin closed this as completed in #1897 Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

udf schema inference should not set column nullable=false #1885

udf schema inference should not set column nullable=false #1885

vkrot-exos commented Nov 4, 2020

HyukjinKwon commented Nov 9, 2020

udf schema inference should not set column nullable=false #1885

udf schema inference should not set column nullable=false #1885

Comments

vkrot-exos commented Nov 4, 2020

HyukjinKwon commented Nov 9, 2020