Skip to content

Version 0.16.0

Compare
Choose a tag to compare
@HyukjinKwon HyukjinKwon released this 22 Aug 06:35
· 1105 commits to master since this release

Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES environment variable is set to true as below:

>>> import databricks.koalas as ks
>>>
>>> kdf1 = ks.range(5)
>>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
>>> (kdf1 - kdf2).sort_index()
    id
0 -5.0
1 -3.0
2 -1.0
3  NaN
4  NaN
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(5)
>>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
>>> kdf
   id  new_col
0   0      1.0
1   1      2.0
3   3      4.0
2   2      3.0
4   4      NaN

Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX as one of three types:

  • (default) one-by-one: It implements a one-by-one sequence by Window function without
    specifying partition. This index type should be avoided when the data is large.

    >>> ks.range(3)
       id
    0   0
    1   1
    2   2
  • distributed-one-by-one: It implements a one-by-one sequence by group-by and
    group-map approach. It still generates a one-by-one sequential index globally.
    If the default index must be a one-by-one sequence in a large dataset, this
    index can be used.

    >>> ks.range(3)
       id
    0   0
    1   1
    2   2
  • distributed: It implements a monotonically increasing sequence simply by using
    Spark's monotonically_increasing_id function. If the index does not have to be
    a one-by-one sequence, this index can be used. Performance-wise, this index
    almost does not have any penalty comparing to other index types.

    >>> ks.range(3)
                 id
    25769803776   0
    60129542144   1
    94489280512   2

Thirdly, we implemented many plot APIs in Series as follows:

See the example below:

import databricks.koalas as ks

ks.range(10).to_pandas().id.plot.pie()

image

Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:

  • DataFrame.sort_index()(#637)
  • GroupBy.diff()(#653)
  • GroupBy.rank()(#653)
  • Series.any()(#652)
  • Series.all()(#652)
  • DataFrame.any()(#652)
  • DataFrame.all()(#652)
  • DataFrame.assign()(#657)
  • DataFrame.drop()(#658)
  • DataFrame.reindex()(#659)
  • Series.quantile()(#663)
  • Series,transform()(#663)
  • DataFrame.select_dtypes()(#662)
  • DataFrame.transpose()(#664).

Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:

koalas.DataFrame

koalas.groupby.GroupBy:

Along with the following improvements:

  • Add a basic infrastructure for configurations. (#645)
  • Always use column_index. (#648)
  • Allow to omit type hint in GroupBy.transform, filter, apply (#646)