Version 1.4.0
Better type support
We improved the type mapping between pandas and Koalas (#1870, #1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.
Here are some examples:
-
Added
np.float32
and"float32"
(matched toFloatType
)>>> ks.Series([10]).astype(np.float32) 0 10.0 dtype: float32 >>> ks.Series([10]).astype("float32") 0 10.0 dtype: float32
-
Added
np.datetime64
and"datetime64[ns]"
(matched toTimestampType
)>>> ks.Series(["2020-10-26"]).astype(np.datetime64) 0 2020-10-26 dtype: datetime64[ns] >>> ks.Series(["2020-10-26"]).astype("datetime64[ns]") 0 2020-10-26 dtype: datetime64[ns]
-
Fixed
np.int
to matchLongType
, notIntegerType
.>>> pd.Series([100]).astype(np.int) 0 100.0 dtype: int64 >>> ks.Series([100]).astype(np.int) 0 100.0 dtype: int32 # This fixed to `int64` now.
-
Fixed
np.float
to matchDoubleType
, notFloatType
.>>> pd.Series([100]).astype(np.float) 0 100.0 dtype: float64 >>> ks.Series([100]).astype(np.float) 0 100.0 dtype: float32 # This fixed to `float64` now.
We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: Type Support In Koalas.
Return type annotations for major Koalas objects
To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (#1852, #1857, #1859, #1863, #1871, #1882, #1884, #1889, #1892, #1894, #1898, #1899, #1900, #1902).
The return type annotations help auto-completion libraries, such as Jedi, to infer the actual data type and provide proper suggestions:
- Before
- After
It also helps mypy enable static analysis over the method body.
pandas 1.1.4 support
We verified the behaviors of pandas 1.1.4 in Koalas.
As pandas 1.1.4 introduced a behavior change related to MultiIndex.is_monotonic
(MultiIndex.is_monotonic_increasing
) and MultiIndex.is_monotonic_decreasing
(pandas-dev/pandas#37220), Koalas also changes the behavior (#1881).
Other new features and improvements
We added the following new features:
DataFrame:
__neg__
(#1847)rename_axis
(#1843)spark.repartition
(#1864)spark.coalesce
(#1873)spark.checkpoint
(#1877)spark.local_checkpoint
(#1878)reindex_like
(#1880)
Series:
Index:
intersection
(#1747)
MultiIndex:
intersection
(#1747)
Other improvements and bug fixes
- Use SF.repeat in series.str.repeat (#1844)
- Remove warning when use cache in the context manager (#1848)
- Support a non-string name in Series' boxplot (#1849)
- Calculate fliers correctly in Series.plot.box (#1846)
- Show type name rather than type class in error messages (#1851)
- Fix DataFrame.spark.hint to reflect internal changes. (#1865)
- DataFrame.reindex supports named columns index (#1876)
- Separate InternalFrame.index_map into index_spark_column_names and index_names. (#1879)
- Fix DataFrame.xs to handle internal changes properly. (#1896)
- Explicitly disallow empty list as index_spark_colum_names and index_names. (#1895)
- Use nullable inferred schema in function apply (#1897)
- Introduce InternalFrame.index_level. (#1890)
- Remove InternalFrame.index_map. (#1901)
- Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. (#1904)
- Upgrade PyArrow from 1.0.1 to 2.0.0 in CI (#1860)
- Fix read_excel to support squeeze argument. (#1905)
- Fix to_csv to avoid duplicated option 'path' for DataFrameWriter. (#1912)