Optimize memory usage (cre-dev#6)

Parsing: * Add iterative parsing as an optional behavior * Customized hash function Inserting: * Add optional max_lines in insert Model config: * Add configurable metadata * Allow custom indices to be added
martinv13 · Jun 27, 2024 · ec5ef05 · ec5ef05
1 parent b547653
commit ec5ef05
Show file tree

Hide file tree

Showing 42 changed files with 1,331 additions and 1,044 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -17,7 +17,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: ["3.9", "3.10", "3.11"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
 
     steps:
     - uses: actions/checkout@v3

diff --git a/docs/api/overview.md b/docs/api/overview.md
@@ -28,15 +28,22 @@
 ## *Advanced use:* loading data into the database
 
 The flow chart below presents data conversions used to load an XML file into the database, showing the functions used 
-for lower level steps. It can be useful for advanced use case if you want for instance to transform the data in 
-intermediate steps.
+for lower level steps. It can be useful for advanced use cases, for instance:
+
+* transforming the data in intermediate steps,
+* adding logging,
+* limiting concurrent access to the database within a multiprocess setup, etc.
+
+For those scenarios you can easily reimplement 
+[`Document.insert_into_target_tables`](document.md/#xml2db.document.Document.insert_into_target_tables) to suit your 
+needs, using lower level functions.
 
 ```mermaid
 flowchart TB
     subgraph "<a href='../data_model/#xml2db.model.DataModel.parse_xml' style='color:var(--md-code-fg-color)'>DataModel.parse_xml</a>"
         direction TB
         A[XML file]-- "<a href='../xml_converter/#xml2db.xml_converter.XMLConverter.parse_xml' style='color:var(--md-code-fg-color)'>XMLConverter.parse_xml</a>" -->B[Document tree]
-        B-- "Document._compute_records_hashes\n<a href='../document/#xml2db.document.Document.doc_tree_to_flat_data' style='color:var(--md-code-fg-color)'>Document.doc_tree_to_flat_data</a>" -->C[Flat data model]
+        B-- "<a href='../document/#xml2db.document.Document.doc_tree_to_flat_data' style='color:var(--md-code-fg-color)'>Document.doc_tree_to_flat_data</a>" -->C[Flat data model]
     end
     C -.- D
     subgraph "<a href='../document/#xml2db.document.Document.insert_into_target_tables' style='color:var(--md-code-fg-color)'>Document.insert_into_target_tables</a>"
@@ -49,8 +56,7 @@ flowchart TB
 ## *Advanced use:* get data from the database back to XML
 
 The flow chart below presents data conversions used to get back data from the database into XML, showing the functions 
-used for lower level steps. It can be useful for advanced use case if you want for instance to transform the data in 
-intermediate steps.
+used for lower level steps.
 
 ```mermaid
 flowchart TB

diff --git a/docs/configuring.md b/docs/configuring.md
@@ -16,9 +16,62 @@ The column types can also be configured to override the default type mapping, us
     diagram (see the [Getting started](getting_started.md) page for directions on how to visualize data models) and 
     then adapt the configuration if need be.
 
-Configuration options are described below.  
+Configuration options are described below. Some options can be set at the model level, others at the table level and
+others at the field level. The general structure of the configuration dict is the following:
+
+```py title="Model config general structure" linenums="1" 
+{
+    "document_tree_hook": None,
+    "document_tree_node_hook": None,
+    "row_numbers": False,
+    "as_columnstore": False,
+    "metadata_columns": None,
+    "tables": {
+        "table1": {
+            "reuse": True,
+            "choice_transform": False,
+            "as_columnstore": False,
+            "fields": {
+                "my_column": {
+                    "type": None #default type
+                } 
+            },
+            "extra_args": [],
+        }
+    }
+}
+```
+
+## Model configuration
 
-## Field level config
+The following options can be passed as a top-level keys of the model configuration `dict`:
+
+* `document_tree_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It gives direct
+access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
+for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
+should of course stay compatible with the data model.
+* `document_tree_node_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It is
+similar with `document_tree_hook`, but it is call as soon as a node is completed, not waiting for the entire parsing to
+finish. It is especially useful if you intend to filter out some nodes and reduce memory footprint while parsing.
+* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when 
+deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
+always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The 
+default value is `False` (disabled).
+* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
+the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
+clustered columnstore indexes. The default value is `False` (disabled).
+* `metadata_columns` (`list`): a list of extra columns that you want to add to the root table of your model. This is
+useful for instance to add the name of the file which has been parsed, or a timestamp, etc. Columns should be specified
+as dicts, the only required keys are `name` and `type` (a SQLAlchemy type object); other keys will be passed directly
+as keyword arguments to `sqlalchemy.Column`. Actual values need to be passed to 
+[`Document.insert_into_target_tables`](api/document.md#xml2db.document.Document.insert_into_target_tables) for each 
+parsed documents, as a `dict`, using the `metadata` argument.
+* `record_hash_column_name`: the column name to use to store records hash data (defaults to `xml2db_record_hash`).
+* `record_hash_constructor`: a function used to build a hash, with a signature similar to `hashlib` constructor 
+functions (defaults to `hashlib.sha1`).
+* `record_hash_size`: the byte size of the record hash (defaults to 20, which is the size of a `sha-1` hash).
+
+## Fields configuration
 
 These configuration options are defined for a specific field of a specific table. A "field" refers to a column in the
 table, or a child table.
@@ -140,7 +193,7 @@ timeInterval_end[1, 1]: string
     }
     ```
 
-## Table level config
+## Tables configuration
 
 ### Simplify "choice groups"
 
@@ -226,20 +279,22 @@ With MS SQL Server database backend, `xml2db` can create
 on tables. However, for `n-n` relationships tables, this option needs to be set globally (see below). The default value 
 is `False` (disabled).
 
-Configuration: `"as_columnstore":` `False` (default) or `True`
+### Extra arguments
 
-## Global options
+Extra arguments can be passed to `sqlalchemy.Table` constructors, for instance if you want to customize indexes. These
+can be passed in an iterable (e.g. `tuple` or `list`) which will be simply unpacked into the `sqlalchemy.Table` 
+constructor when building the table.
 
-These options can be passed as a top-level keys of the model configuration `dict`:
+Configuration: `"extra_args": []` (default)
 
-* `document_tree_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It gives direct
-access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
-for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
-should of course stay compatible with the data model.
-* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when 
-deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
-always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The 
-default value is `False` (disabled).
-* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
-the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
-clustered columnstore indexes. The default value is `False` (disabled).
+!!! example
+    Adding an index on a specific column:
+    ``` python
+    model_config = {
+        "tables": {
+            "my_table": {
+                "extra_args": sqlalchemy.Index("my_index", "my_column1", "my_column2"),
+            }
+        }
+    }
+    ```
diff --git a/docs/getting_started.md b/docs/getting_started.md
@@ -117,11 +117,9 @@ Please read the [How it works](how_it_works.md) page to learn more about the pro
 troubleshooting if need be.
 
 !!! note
-    `xml2db` saves metadata for all loaded XML files. These are currently not configurable and create two additional
-    columns in the root table:
-
-    * `xml2db_input_file_path`: the file path provided to `DataModel.parse_xml`,
-    * `xml2db_processed_at`: the timestamp at which `DataModel.parse_xml` was called.
+    `xml2db` can save metadata for each loaded XML file. These can be configured using the 
+    [`metadata_columns` option](configuring.md#model-configuration) and create additional columns in the root table.
+    It can be used for instance to save file name or loading timestamp.
 
 ## Getting back the data into XML
 

diff --git a/docs/how_it_works.md b/docs/how_it_works.md
@@ -151,8 +151,8 @@ in memory makes the processing way simpler and faster. We handle files with a si
 
 ### Computing hashes
 
-We compute tree hashes (`sha-1`) recursively by adding to each node's hash the hashes of its children element, be it 
-simple types, attributes or complex types. Children are processed in the specific order they appeared in the XSD schema, 
+We compute tree hashes  recursively by adding to each node's hash the hashes of its children element, be it simple 
+types, attributes or complex types. Children are processed in the specific order they appeared in the XSD schema, 
 so that hashing is really deterministic.
 
 Right after this step, a hook function is called if provided in the configuration (top-level `document_tree_hook` option
@@ -187,7 +187,9 @@ We keep the primary keys from the flat data model created at the previous stage,
 The last step is to merge the temporary tables data into the target tables, while enforcing deduplication, keeping 
 relationships, etc.
 
-This is done by issuing a sequence of `update` and `insert` SQL statements using `sqlalchemy`, in a single transaction.
+This is done by issuing a sequence of `update` and `insert` SQL statements using `sqlalchemy`, in a single transaction
+(default) or in multiple transactions.
+
 The process boils down to:
 
 * inserting missing records into the target tables, 

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "xml2db"
-version = "0.10.1"
+version = "0.11.0"
 authors = [
   { name="Commission de régulation de l'énergie", email="[email protected]" },
 ]

diff --git a/requirements.txt b/requirements.txt
@@ -1,18 +1,18 @@
-Babel==2.14.0
-certifi==2024.2.2
+Babel==2.15.0
+certifi==2024.6.2
 charset-normalizer==3.3.2
 click==8.1.7
 colorama==0.4.6
 elementpath==4.4.0
-exceptiongroup==1.2.0
+exceptiongroup==1.2.1
 ghp-import==2.1.0
 greenlet==3.0.3
-griffe==0.45.0
-idna==3.6
+griffe==0.45.2
+idna==3.7
 iniconfig==2.0.0
-Jinja2==3.1.3
+Jinja2==3.1.4
 lxml==5.1.0
-Markdown==3.5.2
+Markdown==3.6
 MarkupSafe==2.1.5
 mergedeep==1.3.4
 mkdocs==1.6.0
@@ -25,22 +25,20 @@ mkdocstrings-python==1.10.2
 packaging==24.0
 paginate==0.5.6
 pathspec==0.12.1
-platformdirs==4.2.0
-pluggy==1.4.0
-psycopg2==2.9.9
-Pygments==2.17.2
-pymdown-extensions==10.7.1
-pyodbc==5.1.0
-pytest==8.1.1
+platformdirs==4.2.2
+pluggy==1.5.0
+Pygments==2.18.0
+pymdown-extensions==10.8.1
+pytest==8.2.2
 python-dateutil==2.9.0.post0
 PyYAML==6.0.1
 pyyaml_env_tag==0.1
-regex==2023.12.25
-requests==2.31.0
+regex==2024.5.15
+requests==2.32.3
 six==1.16.0
-SQLAlchemy==2.0.28
+SQLAlchemy==2.0.30
 tomli==2.0.1
-typing_extensions==4.10.0
+typing_extensions==4.12.1
 urllib3==2.2.1
-watchdog==4.0.0
+watchdog==4.0.1
 xmlschema==3.1.0
diff --git a/src/xml2db/__init__.py b/src/xml2db/__init__.py
@@ -1,6 +1,6 @@
-from xml2db.model import DataModel
-from xml2db.document import Document
-from xml2db.table import (
+from .model import DataModel
+from .document import Document
+from .table import (
     DataModelTable,
     DataModelTableReused,
     DataModelTableDuplicated,