Release Version 0.16.0 · databricks/koalas

Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES environment variable is set to true as below:

>>> import databricks.koalas as ks
>>>
>>> kdf1 = ks.range(5)
>>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
>>> (kdf1 - kdf2).sort_index()
    id
0 -5.0
1 -3.0
2 -1.0
3  NaN
4  NaN

>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(5)
>>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
>>> kdf
   id  new_col
0   0      1.0
1   1      2.0
3   3      4.0
2   2      3.0
4   4      NaN

Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX as one of three types:

(default) one-by-one: It implements a one-by-one sequence by Window function without
specifying partition. This index type should be avoided when the data is large.
```
>>> ks.range(3)
   id
0   0
1   1
2   2
```
distributed-one-by-one: It implements a one-by-one sequence by group-by and
group-map approach. It still generates a one-by-one sequential index globally.
If the default index must be a one-by-one sequence in a large dataset, this
index can be used.
```
>>> ks.range(3)
   id
0   0
1   1
2   2
```
distributed: It implements a monotonically increasing sequence simply by using
Spark's monotonically_increasing_id function. If the index does not have to be
a one-by-one sequence, this index can be used. Performance-wise, this index
almost does not have any penalty comparing to other index types.
```
>>> ks.range(3)
             id
25769803776   0
60129542144   1
94489280512   2
```

Thirdly, we implemented many plot APIs in Series as follows:

plot.pie() (#669)
plot.area() (#670)
plot.line() (#671)
plot.barh() (#673)

See the example below:

import databricks.koalas as ks

ks.range(10).to_pandas().id.plot.pie()

Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:

DataFrame.sort_index()(#637)
GroupBy.diff()(#653)
GroupBy.rank()(#653)
Series.any()(#652)
Series.all()(#652)
DataFrame.any()(#652)
DataFrame.all()(#652)
DataFrame.assign()(#657)
DataFrame.drop()(#658)
DataFrame.reindex()(#659)
Series.quantile()(#663)
Series,transform()(#663)
DataFrame.select_dtypes()(#662)
DataFrame.transpose()(#664).

Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:

koalas.DataFrame

duplicated() (#569)
fillna() (#640)
bfill() (#640)
pad() (#640)
ffill() (#640)

koalas.groupby.GroupBy:

diff() (#622)
nunique() (#617)
nlargest() (#654)
nsmallest() (#654)
idxmax() (#649)
idxmin() (#649)

Along with the following improvements:

Add a basic infrastructure for configurations. (#645)
Always use column_index. (#648)
Allow to omit type hint in GroupBy.transform, filter, apply (#646)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 0.16.0