Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

udf schema inference should not set column nullable=false #1885

Closed
vkrot-exos opened this issue Nov 4, 2020 · 1 comment · Fixed by #1897
Closed

udf schema inference should not set column nullable=false #1885

vkrot-exos opened this issue Nov 4, 2020 · 1 comment · Fixed by #1897
Labels
bug Something isn't working

Comments

@vkrot-exos
Copy link

When udf is applied to dataframe and no return schema hint is provided - schema is infered based on data sample and makes wrong assumption about columns nullability. Looks like if all values in some column are not null - resulting schema nullable=false.
But this is a wrong assumption, cause only a sample of data was examined.
Sample code leading to runtime error:

import databricks.koalas as ks
import numpy as np

data = list()
for i in range(1, 10000):
  data.append((str(i % 100), np.nan if i % 9999 == 0 else float(i),))

sdf = spark.createDataFrame(data, 'a string, b float').repartition(10)
kdf = sdf.to_koalas()

def f(df):
  return df

df = kdf\
  .groupby('a')\
  .apply(f)

df.to_spark().printSchema()
df[df['b'] == -1]

printSchema output:

root
 |-- a: string (nullable = false)
 |-- b: float (nullable = false)

Running this code with koalas version 1.3 throws error:

Job aborted due to stage failure: Task 0 in stage 67.0 failed 4 times, most recent failure: Lost task 0.3 in stage 67.0 (TID 435, 10.147.224.152, executor 0): java.lang.IllegalStateException: Value at index is null
@HyukjinKwon
Copy link
Member

As a workaround, you can specify the schema in the type hints (https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hinting-with-names) which should be more performant.

ueshin pushed a commit that referenced this issue Nov 9, 2020
This PR proposes to use nullable schema in when to apply a function. We take some to infer the schema. Usually types are matched but null or NaN is sparse often. This behaviour is batch with Spark's JSON schema inference as well.

```python
from pyspark.sql import SparkSession
import databricks.koalas as ks
import numpy as np

spark = SparkSession.builder.getOrCreate()
data = list()
for i in range(1, 10000):
  data.append((str(i % 100), np.nan if i % 9999 == 0 else float(i),))

sdf = spark.createDataFrame(data, 'a string, b float').repartition(10)
kdf = sdf.to_koalas()

def f(df):
  return df

df = kdf\
  .groupby('a')\
  .apply(f)

df.to_spark().printSchema()
df[df['b'] == -1]
```

**Before:**

```
root
 |-- a: string (nullable = false)
 |-- b: float (nullable = false)

java.lang.IllegalStateException: Value at index is null
...
```

**After:**

```
root
 |-- a: string (nullable = true)
 |-- b: float (nullable = true)

Empty DataFrame
Columns: [a, b]
Index: []
```

Resolves #1885
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants