Version 0.16.0
Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES
environment variable is set to true
as below:
>>> import databricks.koalas as ks
>>>
>>> kdf1 = ks.range(5)
>>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
>>> (kdf1 - kdf2).sort_index()
id
0 -5.0
1 -3.0
2 -1.0
3 NaN
4 NaN
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(5)
>>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
>>> kdf
id new_col
0 0 1.0
1 1 2.0
3 3 4.0
2 2 3.0
4 4 NaN
Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX
as one of three types:
-
(default)
one-by-one
: It implements a one-by-one sequence by Window function without
specifying partition. This index type should be avoided when the data is large.>>> ks.range(3) id 0 0 1 1 2 2
-
distributed-one-by-one
: It implements a one-by-one sequence by group-by and
group-map approach. It still generates a one-by-one sequential index globally.
If the default index must be a one-by-one sequence in a large dataset, this
index can be used.>>> ks.range(3) id 0 0 1 1 2 2
-
distributed
: It implements a monotonically increasing sequence simply by using
Spark'smonotonically_increasing_id
function. If the index does not have to be
a one-by-one sequence, this index can be used. Performance-wise, this index
almost does not have any penalty comparing to other index types.>>> ks.range(3) id 25769803776 0 60129542144 1 94489280512 2
Thirdly, we implemented many plot APIs in Series as follows:
See the example below:
import databricks.koalas as ks
ks.range(10).to_pandas().id.plot.pie()
Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:
DataFrame.sort_index()
(#637)GroupBy.diff()
(#653)GroupBy.rank()
(#653)Series.any()
(#652)Series.all()
(#652)DataFrame.any()
(#652)DataFrame.all()
(#652)DataFrame.assign()
(#657)DataFrame.drop()
(#658)DataFrame.reindex()
(#659)Series.quantile()
(#663)Series,transform()
(#663)DataFrame.select_dtypes()
(#662)DataFrame.transpose()
(#664).
Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:
koalas.DataFrame
koalas.groupby.GroupBy:
Along with the following improvements: