Version 2.0.0
Version 2.0.0 of Awkward Array
The Awkward Array version 2 project started in June of 2021 and has been developed alongside version 1 updates. For most of that time, it was available as a submodule, awkward._v2
, so that it could be tested with the same tests as version 1 and could be experimented upon by early adopters.
The usual list of pull request titles would not be useful as release notes because the changes from 1.10.2 to 2.0.0 are too extensive. But here's a list of their PR numbers:
#884, #895, #896, #914, #957, #958, #959, #962, #977, #1025, #1031, #1036, #1045, #1059, #1063, #1072, #1073, #1074, #1079, #1082, #1092, #1099, #1101, #1109, #1110, #1111, #1116, #1117, #1119, #1121, #1122, #1123, #1124, #1125, #1130, #1131, #1132, #1134, #1135, #1137, #1138, #1140, #1141, #1142, #1143, #1145, #1146, #1147, #1148, #1149, #1150, #1153, #1154, #1156, #1159, #1160, #1161, #1162, #1164, #1165, #1183, #1184, #1201, #1203, #1204, #1206, #1207, #1211, #1214, #1215, #1217, #1218, #1219, #1220, #1221, #1222, #1225, #1226, #1227, #1228, #1229, #1233, #1234, #1240, #1242, #1245, #1248, #1259, #1270, #1276, #1279, #1289, #1290, #1292, #1293, #1294, #1296, #1297, #1300, #1301, #1304, #1306, #1307, #1309, #1312, #1317, #1321, #1327, #1329, #1338, #1340, #1346, #1347, #1351, #1352, #1354, #1355, #1356, #1359, #1360, #1364, #1365, #1367, #1368, #1369, #1370, #1372, #1373, #1374, #1376, #1378, #1380, #1381, #1383, #1384, #1385, #1387, #1390, #1392, #1393, #1394, #1395, #1397, #1398, #1399, #1401, #1404, #1407, #1408, #1409, #1410, #1412, #1413, #1415, #1416, #1418, #1419, #1421, #1422, #1425, #1426, #1427, #1428, #1429, #1430, #1431, #1432, #1433, #1434, #1435, #1437, #1440, #1443, #1444, #1445, #1446, #1447, #1449, #1455, #1456, #1457, #1458, #1462, #1464, #1465, #1467, #1468, #1469, #1470, #1474, #1475, #1476, #1478, #1484, #1485, #1486, #1487, #1490, #1491, #1492, #1493, #1494, #1496, #1497, #1498, #1499, #1502, #1503, #1505, #1508, #1510, #1513, #1514, #1515, #1516, #1518, #1519, #1520, #1521, #1523, #1524, #1527, #1531, #1532, #1533, #1535, #1536, #1537, #1538, #1539, #1540, #1541, #1542, #1543, #1544, #1550, #1555, #1556, #1559, #1560, #1561, #1562, #1564, #1565, #1566, #1567, #1568, #1572, #1573, #1576, #1579, #1581, #1589, #1593, #1597, #1598, #1602, #1603, #1604, #1605, #1607, #1609, #1610, #1613, #1614, #1615, #1616, #1617, #1618, #1619, #1620, #1621, #1625, #1627, #1629, #1632, #1636, #1641, #1642, #1645, #1649, #1650, #1651, #1652, #1653, #1661, #1665, #1666, #1671, #1673, #1674, #1675, #1677, #1679, #1689, #1691, #1692, #1695, #1698, #1699, #1700, #1706, #1708, #1712, #1715, #1716, #1717, #1721, #1722, #1723, #1725, #1730, #1731, #1732, #1733, #1739, #1740, #1743, #1744, #1746, #1748, #1749, #1750, #1751, #1752, #1754, #1757, #1758, #1759, #1760, #1761, #1763, #1768, #1769, #1770, #1773, #1774, #1776, #1777, #1779, #1781, #1783, #1787, #1788, #1795, #1796, #1797, #1798, #1800, #1801, #1803, #1804, #1811, #1812, #1813, #1815, #1816, #1822, #1825, #1826, #1827, #1829, #1830, #1831, #1832, #1835, #1836, #1837, #1838, #1841, #1844, #1845, #1848, #1851, #1852, #1853, #1854, #1856, #1857, #1858, #1859, #1860, #1861, #1863, #1867, #1869, #1871, #1873, #1876, #1877, #1878, #1880, #1881, #1891, #1892, #1894, #1895, #1897, #1898, #1900, #1905, #1907, #1908, #1911, #1912, #1913, #1915, #1919, #1920, #1921, #1922, #1928, #1930, #1934, #1938, #1939, #1940, #1942, #1943, #1946, #1948, #1949, #1950, #1951, #1952, #1953, #1954, #1955, #1956, #1959, #1960, #1962, #1965, #1966, #1968, #1970, #1971, #1972, #1974, #1976, #1977, #1979, #1981, #1982, #1983, #1985, #1986
Full Changelog: v1.10.2...v2.0.0
Despite the long list of PRs, the high-level interface changes from version 1 to version 2 were kept at a minimum. For the most part, the Awkward 1.x API is fine, but the internal implementation needed an overhaul to prevent technical debt.
The work was done by the Awkward Array developers:
- @agoose77
- @henryiii
- @ianna
- @ioanaif
- @jpivarski
- @ManasviGoyal
- @swishdiff
In particular, most of the translation from version 1 to version 2 was the work of @ioanaif, the build/deployment was from @henryiii and @agoose77, the Awkward-RDataFrame bridge and other C++ interface from @ianna, GrowableBuffer/LayoutBuilder from @ManasviGoyal, and the CUDA and JAX foundations were laid by @swishdiff.
Additionally, we had help from:
Summary of changes
Nearly all of the code is written in Python now. Exceptions are the "kernel" functions, GrowableBuffer, LayoutBuilder, ArrayBuilder, AwkwardForth, and dynamically generated C++ code for RDataFrame.
Maintains performance because any algorithms that scale with the size of an array are implemented in compiled "kernel" functions.
Split into two packages: awkward-cpp
for the C++ part (infrequently updated, binary distribution for most platforms and Python versions) and awkward
, the Python part (frequently updated).
Virtual arrays and Partitions (collectively, "lazy arrays") have been removed in favor of dask-awkward.
Awkward Arrays can be converted to and from RDataFrame, generating C++ for ROOT to JIT-compile so that iteration over Awkward Array input is fast (adapted from the Numba implementation).
Auto-differentiation of functions on Awkward Arrays using JAX. (But not JAX JIT-compilation.)
Suite of header-only C++ that does not depend on Awkward Arrays, but can be used to produce them and quickly get them from C++ to Python. The header-only suite includes GrowableBuffer and LayoutBuilder.
New documentation website (https://awkward-array.org/), based on JupyterBooks, the NumPy/SciPy/Pandas style and organization, as well as a notebook that can be executed in your web browser.
More expressive error-messages, highlighting the ak.*
function that was in progress when the error occurred, with its arguments. (That is, highlighting ak.*
functions as the granularity of feedback to users of Awkward Array, rather than making you search through the stack trace to the hand-off from your code to ours.)
Brackets are always balanced in the console representation of an array:
>>> ak.Array([
... [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
... [],
... [{"x": 3.3, "y": [1, 2, 3]}],
... ])
<Array [[{x: 1.1, y: [1]}, {...}], ...] type='3 * var * {x: float64, y: var...'>
as opposed to
<Array [[{x: 1.1, y: [1]}, ... y: [1, 2, 3]}]] type='3 * var * {"x": float64, "y...'>
in version 1. Also, show
methods for values
[[{x: 1.1, y: [1]}, {x: 2.2, y: [1, 2]}],
[],
[{x: 3.3, y: [1, 2, 3]}]]
and types
3 * var * {
x: float64,
y: var * int64
}
This extended show
output is the default representation in Jupyter.
Round-trip fidelity in ak.to_arrow
/ak.from_arrow
: no Awkward Array metadata is lost. Same for ak.to_parquet
/ak.from_parquet
, to the extent that pyarrow can read and write Parquet.
Parquet column selection using wildcards.
Data exported with version 1 ak.to_buffers
can be imported by version 2 ak.from_buffers
, with custom buffer_keys
.
The majority of version 1 tests have been ported to version 2, to ensure that the interface and functionality doesn't change, except where intended (e.g. organizing naming conventions).
Consistent handling of date-time and time-delta types (matches NumPy's system).
Improved ak.to_json
/ak.from_json
arguments (for converting non-JSON types NaN, infinity, complex numbers) and using a known JSONSchema to accelerate ak.from_json
. Removed ambiguities about newline-delimited JSON (requires explicit argument).
The world's fastest Avro file reader in Python, ak.from_avro_file
(uses AwkwardForth).
"nan" versions of NumPy functions, such as np.nansum
, np.nanmean
, np.nanstd
.
Renamed ak.to_pandas
→ ak.to_dataframe
, to clarify distinction from awkward-pandas.
Organized Type
and Form
objects better, more consistent.
Clear specification of NumPy dtypes that can be used in Awkward Arrays (bool, numbers, including complex, and date-time/time-delta).
Organized naming conventions throughout the codebase, such as keys
versus fields
versus recordlookup
.
Carefully examined the public API (all modules, functions, classes, and methods that don't start with an underscore) to be sure that we can support it going forward. Any removal or change of an interface will require a minor version number increase and a deprecation cycle, on the order of months. (New features and bug-fixes can be immediate, on patch releases.)
Flags and "configuration" function arguments are now keyword-only (order independent).
Started adding Python type hints (nowhere near complete, but started).
Removed the Identities
from array nodes. They were never fully implemented—a placeholder for a feature that won't be developed within Awkward Array (SQL-style JOINs).
TypeTracerArray does a "dry run" of a calculation to predict its type at the end. Used to build a computation graph for dask-awkward.
Equivalent but ungainly type combinations, such as "option-type of option-type of X" or "union-type containing union-types," have been outlawed with tools to squash them into a canonical layout. Operations on the data now have fewer possibilities to worry about.
Simplified the semantics of nbytes
.
Clarified ak.ravel
and ak.flatten
's treatments of missing data.
Added missing ArrayBuilder methods in Numba.
Set up framework for performing ak.*
operations in CUDA, using CuPy JIT-compilation to get the right interface, with error message passing through asynchronous GPU computations. The CUDA implementation is not complete, but the groundwork has been laid.
Parquet reading/writing uses fsspec
for remote connections and handles the dataset protocol correctly.
Public interface to our power-tools: ak.transform
unifies our internal recursively_apply
and broadcast_and_apply
.
All ak.*
functions were rewritten; some were redesigned in a simpler way using the above-mentioned power tools. There are likely fewer undiscovered bugs in these functions now.
Use of records in numerical functions is more restricted, to prevent misapplication of ufuncs like +
and reducers like np.sum
on meaningful records such as Vectors.
All docstrings updated for version 2, including the code examples.