Skip to content

Commit

Permalink
Optimize memory usage (cre-dev#6)
Browse files Browse the repository at this point in the history
Parsing:
* Add iterative parsing as an optional behavior
* Customized hash function

Inserting:
* Add optional max_lines in insert

Model config:
* Add configurable metadata
* Allow custom indices to be added
  • Loading branch information
cre-os authored Jun 27, 2024
1 parent b547653 commit ec5ef05
Show file tree
Hide file tree
Showing 42 changed files with 1,331 additions and 1,044 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11"]
python-version: ["3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v3
Expand Down
16 changes: 11 additions & 5 deletions docs/api/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,22 @@
## *Advanced use:* loading data into the database

The flow chart below presents data conversions used to load an XML file into the database, showing the functions used
for lower level steps. It can be useful for advanced use case if you want for instance to transform the data in
intermediate steps.
for lower level steps. It can be useful for advanced use cases, for instance:

* transforming the data in intermediate steps,
* adding logging,
* limiting concurrent access to the database within a multiprocess setup, etc.

For those scenarios you can easily reimplement
[`Document.insert_into_target_tables`](document.md/#xml2db.document.Document.insert_into_target_tables) to suit your
needs, using lower level functions.

```mermaid
flowchart TB
subgraph "<a href='../data_model/#xml2db.model.DataModel.parse_xml' style='color:var(--md-code-fg-color)'>DataModel.parse_xml</a>"
direction TB
A[XML file]-- "<a href='../xml_converter/#xml2db.xml_converter.XMLConverter.parse_xml' style='color:var(--md-code-fg-color)'>XMLConverter.parse_xml</a>" -->B[Document tree]
B-- "Document._compute_records_hashes\n<a href='../document/#xml2db.document.Document.doc_tree_to_flat_data' style='color:var(--md-code-fg-color)'>Document.doc_tree_to_flat_data</a>" -->C[Flat data model]
B-- "<a href='../document/#xml2db.document.Document.doc_tree_to_flat_data' style='color:var(--md-code-fg-color)'>Document.doc_tree_to_flat_data</a>" -->C[Flat data model]
end
C -.- D
subgraph "<a href='../document/#xml2db.document.Document.insert_into_target_tables' style='color:var(--md-code-fg-color)'>Document.insert_into_target_tables</a>"
Expand All @@ -49,8 +56,7 @@ flowchart TB
## *Advanced use:* get data from the database back to XML

The flow chart below presents data conversions used to get back data from the database into XML, showing the functions
used for lower level steps. It can be useful for advanced use case if you want for instance to transform the data in
intermediate steps.
used for lower level steps.

```mermaid
flowchart TB
Expand Down
89 changes: 72 additions & 17 deletions docs/configuring.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,62 @@ The column types can also be configured to override the default type mapping, us
diagram (see the [Getting started](getting_started.md) page for directions on how to visualize data models) and
then adapt the configuration if need be.

Configuration options are described below.
Configuration options are described below. Some options can be set at the model level, others at the table level and
others at the field level. The general structure of the configuration dict is the following:

```py title="Model config general structure" linenums="1"
{
"document_tree_hook": None,
"document_tree_node_hook": None,
"row_numbers": False,
"as_columnstore": False,
"metadata_columns": None,
"tables": {
"table1": {
"reuse": True,
"choice_transform": False,
"as_columnstore": False,
"fields": {
"my_column": {
"type": None #default type
}
},
"extra_args": [],
}
}
}
```

## Model configuration

## Field level config
The following options can be passed as a top-level keys of the model configuration `dict`:

* `document_tree_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It gives direct
access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
should of course stay compatible with the data model.
* `document_tree_node_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It is
similar with `document_tree_hook`, but it is call as soon as a node is completed, not waiting for the entire parsing to
finish. It is especially useful if you intend to filter out some nodes and reduce memory footprint while parsing.
* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when
deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The
default value is `False` (disabled).
* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
clustered columnstore indexes. The default value is `False` (disabled).
* `metadata_columns` (`list`): a list of extra columns that you want to add to the root table of your model. This is
useful for instance to add the name of the file which has been parsed, or a timestamp, etc. Columns should be specified
as dicts, the only required keys are `name` and `type` (a SQLAlchemy type object); other keys will be passed directly
as keyword arguments to `sqlalchemy.Column`. Actual values need to be passed to
[`Document.insert_into_target_tables`](api/document.md#xml2db.document.Document.insert_into_target_tables) for each
parsed documents, as a `dict`, using the `metadata` argument.
* `record_hash_column_name`: the column name to use to store records hash data (defaults to `xml2db_record_hash`).
* `record_hash_constructor`: a function used to build a hash, with a signature similar to `hashlib` constructor
functions (defaults to `hashlib.sha1`).
* `record_hash_size`: the byte size of the record hash (defaults to 20, which is the size of a `sha-1` hash).

## Fields configuration

These configuration options are defined for a specific field of a specific table. A "field" refers to a column in the
table, or a child table.
Expand Down Expand Up @@ -140,7 +193,7 @@ timeInterval_end[1, 1]: string
}
```

## Table level config
## Tables configuration

### Simplify "choice groups"

Expand Down Expand Up @@ -226,20 +279,22 @@ With MS SQL Server database backend, `xml2db` can create
on tables. However, for `n-n` relationships tables, this option needs to be set globally (see below). The default value
is `False` (disabled).

Configuration: `"as_columnstore":` `False` (default) or `True`
### Extra arguments

## Global options
Extra arguments can be passed to `sqlalchemy.Table` constructors, for instance if you want to customize indexes. These
can be passed in an iterable (e.g. `tuple` or `list`) which will be simply unpacked into the `sqlalchemy.Table`
constructor when building the table.

These options can be passed as a top-level keys of the model configuration `dict`:
Configuration: `"extra_args": []` (default)

* `document_tree_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It gives direct
access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
should of course stay compatible with the data model.
* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when
deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The
default value is `False` (disabled).
* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
clustered columnstore indexes. The default value is `False` (disabled).
!!! example
Adding an index on a specific column:
``` python
model_config = {
"tables": {
"my_table": {
"extra_args": sqlalchemy.Index("my_index", "my_column1", "my_column2"),
}
}
}
```
8 changes: 3 additions & 5 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,11 +117,9 @@ Please read the [How it works](how_it_works.md) page to learn more about the pro
troubleshooting if need be.

!!! note
`xml2db` saves metadata for all loaded XML files. These are currently not configurable and create two additional
columns in the root table:

* `xml2db_input_file_path`: the file path provided to `DataModel.parse_xml`,
* `xml2db_processed_at`: the timestamp at which `DataModel.parse_xml` was called.
`xml2db` can save metadata for each loaded XML file. These can be configured using the
[`metadata_columns` option](configuring.md#model-configuration) and create additional columns in the root table.
It can be used for instance to save file name or loading timestamp.

## Getting back the data into XML

Expand Down
8 changes: 5 additions & 3 deletions docs/how_it_works.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,8 @@ in memory makes the processing way simpler and faster. We handle files with a si

### Computing hashes

We compute tree hashes (`sha-1`) recursively by adding to each node's hash the hashes of its children element, be it
simple types, attributes or complex types. Children are processed in the specific order they appeared in the XSD schema,
We compute tree hashes recursively by adding to each node's hash the hashes of its children element, be it simple
types, attributes or complex types. Children are processed in the specific order they appeared in the XSD schema,
so that hashing is really deterministic.

Right after this step, a hook function is called if provided in the configuration (top-level `document_tree_hook` option
Expand Down Expand Up @@ -187,7 +187,9 @@ We keep the primary keys from the flat data model created at the previous stage,
The last step is to merge the temporary tables data into the target tables, while enforcing deduplication, keeping
relationships, etc.

This is done by issuing a sequence of `update` and `insert` SQL statements using `sqlalchemy`, in a single transaction.
This is done by issuing a sequence of `update` and `insert` SQL statements using `sqlalchemy`, in a single transaction
(default) or in multiple transactions.

The process boils down to:

* inserting missing records into the target tables,
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "xml2db"
version = "0.10.1"
version = "0.11.0"
authors = [
{ name="Commission de régulation de l'énergie", email="[email protected]" },
]
Expand Down
36 changes: 17 additions & 19 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
Babel==2.14.0
certifi==2024.2.2
Babel==2.15.0
certifi==2024.6.2
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
elementpath==4.4.0
exceptiongroup==1.2.0
exceptiongroup==1.2.1
ghp-import==2.1.0
greenlet==3.0.3
griffe==0.45.0
idna==3.6
griffe==0.45.2
idna==3.7
iniconfig==2.0.0
Jinja2==3.1.3
Jinja2==3.1.4
lxml==5.1.0
Markdown==3.5.2
Markdown==3.6
MarkupSafe==2.1.5
mergedeep==1.3.4
mkdocs==1.6.0
Expand All @@ -25,22 +25,20 @@ mkdocstrings-python==1.10.2
packaging==24.0
paginate==0.5.6
pathspec==0.12.1
platformdirs==4.2.0
pluggy==1.4.0
psycopg2==2.9.9
Pygments==2.17.2
pymdown-extensions==10.7.1
pyodbc==5.1.0
pytest==8.1.1
platformdirs==4.2.2
pluggy==1.5.0
Pygments==2.18.0
pymdown-extensions==10.8.1
pytest==8.2.2
python-dateutil==2.9.0.post0
PyYAML==6.0.1
pyyaml_env_tag==0.1
regex==2023.12.25
requests==2.31.0
regex==2024.5.15
requests==2.32.3
six==1.16.0
SQLAlchemy==2.0.28
SQLAlchemy==2.0.30
tomli==2.0.1
typing_extensions==4.10.0
typing_extensions==4.12.1
urllib3==2.2.1
watchdog==4.0.0
watchdog==4.0.1
xmlschema==3.1.0
6 changes: 3 additions & 3 deletions src/xml2db/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from xml2db.model import DataModel
from xml2db.document import Document
from xml2db.table import (
from .model import DataModel
from .document import Document
from .table import (
DataModelTable,
DataModelTableReused,
DataModelTableDuplicated,
Expand Down
Loading

0 comments on commit ec5ef05

Please sign in to comment.