You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When udf is applied to dataframe and no return schema hint is provided - schema is infered based on data sample and makes wrong assumption about columns nullability. Looks like if all values in some column are not null - resulting schema nullable=false.
But this is a wrong assumption, cause only a sample of data was examined.
Sample code leading to runtime error:
import databricks.koalas as ks
import numpy as np
data = list()
for i in range(1, 10000):
data.append((str(i % 100), np.nan if i % 9999 == 0 else float(i),))
sdf = spark.createDataFrame(data, 'a string, b float').repartition(10)
kdf = sdf.to_koalas()
def f(df):
return df
df = kdf\
.groupby('a')\
.apply(f)
df.to_spark().printSchema()
df[df['b'] == -1]
Running this code with koalas version 1.3 throws error:
Job aborted due to stage failure: Task 0 in stage 67.0 failed 4 times, most recent failure: Lost task 0.3 in stage 67.0 (TID 435, 10.147.224.152, executor 0): java.lang.IllegalStateException: Value at index is null
The text was updated successfully, but these errors were encountered:
This PR proposes to use nullable schema in when to apply a function. We take some to infer the schema. Usually types are matched but null or NaN is sparse often. This behaviour is batch with Spark's JSON schema inference as well.
```python
from pyspark.sql import SparkSession
import databricks.koalas as ks
import numpy as np
spark = SparkSession.builder.getOrCreate()
data = list()
for i in range(1, 10000):
data.append((str(i % 100), np.nan if i % 9999 == 0 else float(i),))
sdf = spark.createDataFrame(data, 'a string, b float').repartition(10)
kdf = sdf.to_koalas()
def f(df):
return df
df = kdf\
.groupby('a')\
.apply(f)
df.to_spark().printSchema()
df[df['b'] == -1]
```
**Before:**
```
root
|-- a: string (nullable = false)
|-- b: float (nullable = false)
java.lang.IllegalStateException: Value at index is null
...
```
**After:**
```
root
|-- a: string (nullable = true)
|-- b: float (nullable = true)
Empty DataFrame
Columns: [a, b]
Index: []
```
Resolves#1885
When udf is applied to dataframe and no return schema hint is provided - schema is infered based on data sample and makes wrong assumption about columns nullability. Looks like if all values in some column are not null - resulting schema nullable=false.
But this is a wrong assumption, cause only a sample of data was examined.
Sample code leading to runtime error:
printSchema output:
Running this code with koalas version 1.3 throws error:
The text was updated successfully, but these errors were encountered: