diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 4fe4b935f144b..3ae34f4bb6ebb 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -8,38 +8,41 @@ - Author: [Philippe THOMY](https://github.com/loco-philippe) - Revision: 3 +##### Summary -#### Summary - [Abstract](./0012-compact-and-reversible-JSON-interface.md/#Abstract) - - [Problem description](./0012-compact-and-reversible-JSON-interface.md/#Problem-description) - - [Feature Description](./0012-compact-and-reversible-JSON-interface.md/#Feature-Description) + - [Problem description](./0012-compact-and-reversible-JSON-interface.md/#Problem-description) + - [Feature Description](./0012-compact-and-reversible-JSON-interface.md/#Feature-Description) - [Scope](./0012-compact-and-reversible-JSON-interface.md/#Scope) - [Motivation](./0012-compact-and-reversible-JSON-interface.md/#Motivation) - - [Why is it important to have a compact and reversible JSON interface ?](./0012-compact-and-reversible-JSON-interface.md/#Why-is-it-important-to-have-a-compact-and-reversible-JSON-interface-?) - - [Is it relevant to take an extended type into account ?](./0012-compact-and-reversible-JSON-interface.md/#Is-it-relevant-to-take-an-extended-type-into-account-?) - - [Is this only useful for pandas ?](./0012-compact-and-reversible-JSON-interface.md/#Is-this-only-useful-for-pandas-?) + - [Why is it important to have a compact and reversible JSON interface ?](./0012-compact-and-reversible-JSON-interface.md/#Why-is-it-important-to-have-a-compact-and-reversible-JSON-interface-?) + - [Is it relevant to take an extended type into account ?](./0012-compact-and-reversible-JSON-interface.md/#Is-it-relevant-to-take-an-extended-type-into-account-?) + - [Is this only useful for pandas ?](./0012-compact-and-reversible-JSON-interface.md/#Is-this-only-useful-for-pandas-?) - [Description](./0012-compact-and-reversible-JSON-interface.md/#Description) - - [Data typing](./0012-compact-and-reversible-JSON-interface.md/#Data-typing) - - [Correspondence between TableSchema and pandas](./panda0012-compact-and-reversible-JSON-interfaces_PDEP.md/#Correspondence-between-TableSchema-and-pandas) - - [JSON format](./0012-compact-and-reversible-JSON-interface.md/#JSON-format) - - [Conversion](./0012-compact-and-reversible-JSON-interface.md/#Conversion) + - [Data typing](./0012-compact-and-reversible-JSON-interface.md/#Data-typing) + - [Correspondence between TableSchema and pandas](./panda0012-compact-and-reversible-JSON-interfaces_PDEP.md/#Correspondence-between-TableSchema-and-pandas) + - [JSON format](./0012-compact-and-reversible-JSON-interface.md/#JSON-format) + - [Conversion](./0012-compact-and-reversible-JSON-interface.md/#Conversion) - [Usage and impact](./0012-compact-and-reversible-JSON-interface.md/#Usage-and-impact) - - [Usage](./0012-compact-and-reversible-JSON-interface.md/#Usage) - - [Compatibility](./0012-compact-and-reversible-JSON-interface.md/#Compatibility) - - [Impacts on the pandas framework](./0012-compact-and-reversible-JSON-interface.md/#Impacts-on-the-pandas-framework) - - [Risk to do / risk not to do](./0012-compact-and-reversible-JSON-interface.md/#Risk-to-do-/-risk-not-to-do) + - [Usage](./0012-compact-and-reversible-JSON-interface.md/#Usage) + - [Compatibility](./0012-compact-and-reversible-JSON-interface.md/#Compatibility) + - [Impacts on the pandas framework](./0012-compact-and-reversible-JSON-interface.md/#Impacts-on-the-pandas-framework) + - [Risk to do / risk not to do](./0012-compact-and-reversible-JSON-interface.md/#Risk-to-do-/-risk-not-to-do) - [Implementation](./0012-compact-and-reversible-JSON-interface.md/#Implementation) - - [Modules](./0012-compact-and-reversible-JSON-interface.md/#Modules) - - [Implementation options](./0012-compact-and-reversible-JSON-interface.md/#Implementation-options) + - [Modules](./0012-compact-and-reversible-JSON-interface.md/#Modules) + - [Implementation options](./0012-compact-and-reversible-JSON-interface.md/#Implementation-options) - [F.A.Q.](./0012-compact-and-reversible-JSON-interface.md/#F.A.Q.) - [Synthesis](./0012-compact-and-reversible-JSON-interface.md/Synthesis) - [Core team decision](./0012-compact-and-reversible-JSON-interface.md/#Core-team-decision) - [Timeline](./0012-compact-and-reversible-JSON-interface.md/#Timeline) - [PDEP history](./0012-compact-and-reversible-JSON-interface.md/#PDEP-history) + ------------------------- + ## Abstract ### Problem description + The `dtype` and "Python type" are not explicitly taken into account in the current JSON interface. So, the JSON interface is not always reversible and has inconsistencies related to the consideration of the `dtype`. @@ -48,15 +51,17 @@ Another consequence is the partial application of the Table Schema specification Some JSON-interface problems are detailed in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_json_pandas.ipynb#Current-Json-interface) - ### Feature Description + To have a simple, compact and reversible solution, I propose to use the [JSON-NTV format (Named and Typed Value)](https://github.com/loco-philippe/NTV#readme) - which integrates the notion of type - and its JSON-TAB variation for tabular data (the JSON-NTV format is defined in an [IETF Internet-Draft](https://datatracker.ietf.org/doc/draft-thomy-json-ntv/) (not yet an RFC !!) ). This solution allows to include a large number of types (not necessarily pandas `dtype`) which allows to have: + - a Table Schema JSON interface (`orient="table"`) which respects the Table Schema specification (going from 6 types to 20 types), - a global JSON interface for all pandas data formats. #### Global JSON interface example + In the example below, a DataFrame with several data types is converted to JSON. The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). @@ -65,7 +70,8 @@ With the existing JSON interface, this conversion is not possible. This example uses `ntv_pandas` module defined in the [ntv-pandas repository](https://github.com/loco-philippe/ntv-pandas#readme). -*data example* +Data example: + ```python In [1]: from shapely.geometry import Point from datetime import date @@ -94,7 +100,7 @@ Out[4]: dates::date value value32 res coord::point names unique 600 2022-01-21 30 32 30 POINT (5 6) maria True ``` -*JSON representation* +JSON representation ```python In [5]: df_to_json = npd.to_json(df) @@ -109,16 +115,18 @@ Out[5]: {':tab': {'coord::point': [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0 'value32::int32': [12, 12, 22, 22, 32, 32]}} ``` -*Reversibility* +Reversibility ```python In [5]: df_from_json = npd.read_json(df_to_json) print('df created from JSON is equal to initial df ? ', df_from_json.equals(df)) Out[5]: df created from JSON is equal to initial df ? True ``` + Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_ntv_pandas.ipynb) #### Table Schema JSON interface example + In the example below, a DataFrame with several Table Schema data types is converted to JSON. The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). @@ -142,7 +150,7 @@ Out[3]: end february::date coordinates::point contact::email 2 2025-02-28 POINT (4.9 45.8) walter.white@breaking.com ``` -*JSON representation* +JSON representation ```python In [4]: df_to_table = npd.to_json(df, table=True) @@ -158,16 +166,18 @@ Out[4]: {'schema': {'fields': [{'name': 'index', 'type': 'integer'}, {'index': 2, 'end february': '2025-02-28', 'coordinates': [4.9, 45.8], 'contact': 'walter.white@breaking.com'}]} ``` -*Reversibility* +Reversibility ```python In [5]: df_from_table = npd.read_json(df_to_table) print('df created from JSON is equal to initial df ? ', df_from_table.equals(df)) Out[5]: df created from JSON is equal to initial df ? True ``` + Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_table_pandas.ipynb) ## Scope + The objective is to make available the proposed JSON interface for any type of data and for `orient="table"` option or a new option `orient="ntv"`. The proposed interface is compatible with existing data. @@ -175,32 +185,38 @@ The proposed interface is compatible with existing data. ## Motivation ### Why extend the `orient=table` option to other data types? + - The Table Schema specification defines 24 data types, 6 are taken into account in the pandas interface ### Why is it important to have a compact and reversible JSON interface ? + - a reversible interface provides an exchange format. - a textual exchange format facilitates exchanges between platforms (e.g. OpenData) - a JSON exchange format can be used at API level ### Is it relevant to take an extended type into account ? + - it avoids the addition of an additional data schema - it increases the semantic scope of the data processed by pandas - it is an answer to several issues (e.g. #12997, #14358, #16492, #35420, #35464, #36211, #39537, #49585, #50782, #51375, #52595, #53252) - the use of a complementary type avoids having to modify the pandas data model ### Is this only useful for pandas ? + - the JSON-TAB format is applicable to tabular data and multi-dimensional data. - this JSON interface can therefore be used for any application using tabular or multi-dimensional data. This would allow for example reversible data exchanges between pandas - DataFrame and Xarray - DataArray (Xarray issue under construction) [see example DataFrame / DataArray](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#Multidimensional-data). ## Description The proposed solution is based on several key points: + - data typing - correspondence between TableSchema and pandas - JSON format for tabular data - conversion to and from JSON format ### Data typing + Data types are defined and managed in the NTV project (name, JSON encoder and decoder). Pandas `dtype` are compatible with NTV types : @@ -217,6 +233,7 @@ Pandas `dtype` are compatible with NTV types : | boolean | boolean | Note: + - datetime with timezone is a single NTV type (string ISO8601) - `CategoricalDtype` and `SparseDtype` are included in the tabular JSON format - `object` `dtype` is depending on the context (see below) @@ -234,12 +251,14 @@ JSON types (implicit or explicit) are converted in `dtype` following pandas JSON | null | NaT / NaN / None | Note: + - if an NTV type is defined, the `dtype` is adjusted accordingly - the consideration of null type data needs to be clarified The other NTV types are associated with `object` `dtype`. ### Correspondence between TableSchema and pandas + The TableSchema typing is carried by two attributes `format` and `type`. The table below shows the correspondence between TableSchema format / type and pandas NTVtype / dtype: @@ -264,10 +283,12 @@ The table below shows the correspondence between TableSchema format / type and p | default / geojson | geojson / object | Note: + - other TableSchema format are defined and are to be studied (uuid, binary, topojson, specific format for geopoint and datation) - the first six lines correspond to the existing ### JSON format + The JSON format for the TableSchema interface is the existing. The JSON format for the Global interface is defined in [JSON-TAB](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf) specification. @@ -275,22 +296,27 @@ It includes the naming rules originally defined in the [JSON-ND project](https:/ The specification have to be updated to include sparse data. ### Conversion + When data is associated with a non-`object` `dtype`, pandas conversion methods are used. Otherwise, NTV conversion is used. #### pandas -> JSON + - `NTV type` is not defined : use `to_json()` - `NTV type` is defined and `dtype` is not `object` : use `to_json()` - `NTV type` is defined and `dtype` is `object` : use NTV conversion (if pandas conversion does not exist) #### JSON -> pandas + - `NTV type` is compatible with a `dtype` : use `read_json()` - `NTV type` is not compatible with a `dtype` : use NTV conversion (if pandas conversion does not exist) ## Usage and Impact ### Usage + It seems to me that this proposal responds to important issues: + - having an efficient text format for data exchange The alternative CSV format is not reversible and obsolete (last revision in 2005). Current CSV tools do not comply with the standard. @@ -300,21 +326,26 @@ It seems to me that this proposal responds to important issues: - having a complete Table Schema interface ### Compatibility + Interface can be used without NTV type (compatibility with existing data - [see examples](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_ntv_pandas.ipynb#Appendix-:-Series-tests)) If the interface is available, throw a new `orient` option in the JSON interface, the use of the feature is decoupled from the other features. ### Impacts on the pandas framework + Initially, the impacts are very limited: + - modification of the `name` of `Series` or `DataFrame columns` (no functional impact), - added an option in the Json interface (e.g. `orient='ntv'`) and added associated methods (no functional interference with the other methods) In later stages, several developments could be considered: + - validation of the `name` of `Series` or `DataFrame columns` , - management of the NTV type as a "complementary-object-dtype" - functional extensions depending on the NTV type ### Risk to do / risk not to do + The JSON-NTV format and the JSON-TAB format are not (yet) recognized and used formats. The risk for pandas is that this function is not used (no functional impacts). On the other hand, the early use by pandas will allow a better consideration of the expectations and needs of pandas as well as a reflection on the evolution of the types supported by pandas. @@ -322,6 +353,7 @@ On the other hand, the early use by pandas will allow a better consideration of ## Implementation ### Modules + Two modules are defined for NTV: - json-ntv @@ -335,6 +367,7 @@ Two modules are defined for NTV: The pandas integration of the JSON interface requires importing only the json-ntv module. ### Implementation options + The interface can be implemented as NTV connector (`SeriesConnector` and `DataFrameConnector`) and as a new pandas JSON interface `orient` option. Several pandas implementations are possible: @@ -366,26 +399,28 @@ Several pandas implementations are possible: **A**: In principle, yes, this option takes into account the notion of type. But this is very limited (see examples added in the [Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb)) : + - **Types and Json interface** - - the only way to keep the types in the json interface is to use the `orient='table'` option - - few dtypes are not allowed in json-table interface : period, timedelta64, interval - - allowed types are not always kept in json-table interface - - data with 'object' dtype is kept only id data is string - - with categorical dtype, the underlying dtype is not included in json interface + - the only way to keep the types in the json interface is to use the `orient='table'` option + - few dtypes are not allowed in json-table interface : period, timedelta64, interval + - allowed types are not always kept in json-table interface + - data with 'object' dtype is kept only id data is string + - with categorical dtype, the underlying dtype is not included in json interface - **Data compactness** - - json-table interface is not compact (in the example in the [Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#data-compactness))the size is triple or quadruple the size of the compact format + - json-table interface is not compact (in the example in the [Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#data-compactness))the size is triple or quadruple the size of the compact format - **Reversibility** - - Interface is reversible only with few dtypes : int64, float64, bool, string, datetime64 and partially categorical + - Interface is reversible only with few dtypes : int64, float64, bool, string, datetime64 and partially categorical - **External types** - - the interface does not accept external types - - Table-schema defines 20 data types but the `orient="table"` interface takes into account 5 data types (see [table](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#Converting-table-schema-type-to-pandas-dtype)) - - to integrate external types, it is necessary to first create ExtensionArray and ExtensionDtype objects + - the interface does not accept external types + - Table-schema defines 20 data types but the `orient="table"` interface takes into account 5 data types (see [table](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#Converting-table-schema-type-to-pandas-dtype)) + - to integrate external types, it is necessary to first create ExtensionArray and ExtensionDtype objects The current interface is not compatible with the data structure defined by table-schema. For this to be possible, it is necessary to integrate a "type extension" like the one proposed (this has moreover been partially achieved with the notion of `extDtype` found in the interface for several formats). **Q: In general, we should only have 1 `"table"` format for pandas in read_json/to_json. There is also the issue of backwards compatibility if we do change the format. The fact that the table interface is buggy is not a reason to add a new interface (I'd rather fix those bugs). Can the existing format be adapted in a way that fixes the type issues/issues with roundtripping?** **A**: I will add two additional remarks: + - the types defined in Tableschema are partially taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint, string-email): - the `read_json()` interface works too with the following data: `{'simple': [1,2,3] }` (contrary to what is indicated in the documentation) but it is impossible with `to_json()` to recreate this simple json. @@ -405,15 +440,20 @@ Regarding the underlying JSON-NTV format, its impact is quite low for tabular da Nevertheless, the question is relevant: The JSON-NTV format ([IETF Internet-Draft](https://datatracker.ietf.org/doc/draft-thomy-json-ntv/)) is a shared, documented, supported and implemented format, but indeed the community support is for the moment reduced but it only asks to expand !! ## Synthesis + To conclude, + - if it is important (or strategic) to have a reversible JSON interface for any type of data, the proposal can be allowed, - if not, a third-party package listed in the [ecosystem](https://pandas.pydata.org/community/ecosystem.html) that reads/writes this format to/from pandas DataFrames should be considered ## Core team decision + Vote was open from september-11 to setpember-26: + - Final tally is 0 approvals, 5 abstentions, 7 disapprove. The quorum has been met. The PDEP fails. **Disapprove comments** : + - 1 Given the newness of the proposed JSON NTV format, I would support (as described in the PDEP): "if not, a third-party package listed in the ecosystem that reads/writes this format to/from pandas DataFrames should be considered" - 2 Same reason as -1-, this should be a third party package for now - 3 Not mature enough, and not clear what the market size would be. @@ -423,10 +463,12 @@ Vote was open from september-11 to setpember-26: - 7 while I do think having a more comprehensive JSON format would be worthwhile, making a new format part of pandas means an implicit endorsement of a standard that is still being reviewed by the broader community. **Decision**: + - add the `ntv-pandas` package in the [ecosystem](https://pandas.pydata.org/community/ecosystem.html) - revisit again this PDEP at a later stage, for example in 1/2 to 1 year (based on the evolution of the Internet draft [JSON semantic format (JSON-NTV)](https://www.ietf.org/archive/id/draft-thomy-json-ntv-01.html) and the usage of the [ntv-pandas](https://github.com/loco-philippe/ntv-pandas#readme)) ## Timeline + Not applicable ## PDEP History