Skip to content

Easy importing of tabular data in Semantic Knowledge Graphs databases

License

Notifications You must be signed in to change notification settings

oncodash/ontoweaver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OntoWeaver

Overview

OntoWeaver is a tool for importing table data in Semantic Knowledge Graphs (SKG) databases.

OntoWeaver allows writing a simple declarative mapping to express how columns from a Pandas table are to be converted as typed nodes or edges in an SKG.

It provides a simple layer of abstraction on top of Biocypher, which remains responsible for doing the ontology alignment, supporting several graph database backends, and allowing reproducible & configurable builds.

With a pure Biocypher approach, you would have to write a whole adapter by hand, with OntoWeaver, you just have to express a mapping in YAML, looking like:

row: # The meaning of an entry in the input table.
   map:
      column: <column name in your CSV>
      to_subject: <ontology node type to use for representing a row>

transformers: # How to map cells to nodes and edges.
    - map: # Map a column to a node.
        column: <column name>
        to_object: <ontology node type to use for representing a column>
        via_relation: <edge type for linking subject and object nodes>
    - map: # Map a column to a property.
        column: <another name>
        to_property: <property name>
        for_object: <type holding the property>

metadata: # Optional properties added to every node and edge.
    - source: "My OntoWeaver adapter"
    - version: "v1.2.3"

Installation and quick setup guide

Python Module

The project is written in Python and uses Poetry. You can install the necessary dependencies in a virtual environment like this:

git clone https://github.com/oncodash/ontoweaver.git
cd ontoweaver
poetry install

Poetry will create a virtual environment according to your configuration (either centrally or in the project folder). You can activate it by running poetry shell inside the project directory.

Output Database

Theoretically, the graph can be imported in any [graph] database supported by BioCypher (Neo4j, ArangoDB, CSV, RDF, PostgreSQL, SQLite, NetworkX, … see BioCypher's documentation).

Graph visualization with Neo4j

Neo4j is a popular graph database management system that offers a flexible and efficient way to store, query, and manipulate complex, interconnected data. Cypher is the query language used to interact with Neo4j databases. In order to visualize graphs extracted from databases using OntoWeaver and BioCypher, you can download the Neo4j Graph Database Self-Managed for your operating system. It has been extensively tested with the Community edition.

To create a global variable to Neo4j, add the path to neo4j-admin to your PATH and PYTHONPATH. In order to use the Neo4j browser, you will need to install the correct Java version, depending on the Neo4j version you are using, and add the path to JAVA_HOME. OntoWeaver and BioCypher support versions 4 and 5 of Neo4j.

To run Neo4j (version 5+), use the command neo4j-admin server start after importing your results via the neo4j import sequence provided in the ./biocypher-out/ directory. Use neo4j-admin server stop to disconnect the local server.

Tests

Tests are located in the tests/ subdirectory and may be a good starting point to see OntoWeaver in practice. You may start with tests/test_simplest.py which shows the simplest example of mapping tabular data through BioCypher.

To run tests, use pytest:

poetry run pytest

or, alternatively:

poetry shell
pytest

Usage

OntoWeaver actually automatically provides a working adapter for BioCypher, without you having to do it.

The output of the execution of the adapter is thus what BioCypher is providing (see BioCypher's documentation). In a nutshell, the output is a script file that, when executed, will populate the configured database. By default, the output script file is saved in a subdirectory of ./biocypher-out/, which name is a timestamp from when the adapter has been executed.

To configure your data mapping, you will have to first define the mapping that you want to apply on your data. Then, you will need a BioCypher configuration file (which mainly indiciate your ontologoy and backend), and a schema configuration file (indicating which node and edge types you want).

To actually do something, you need to run OntoWeaver mapping onto your data. We provide a command line interface to do so, called ontoweave.

If you use some default config file (usually biocypher_config.yaml) and schema (usually schema_config.yaml), the simplest call would be:

ontoweave my_data.csv:my_mapping.yaml

If you want to indicate your own configuration files, pass their name as options:

ontoweave --config biocypher_config.yaml --schema schema_config.yaml data-1.1.csv:map-1.yaml data-1.2.csv:map-1.yaml data-A.csv:map-A.yaml

note that you can use the same mapping on several data files, and/or several mappings.

To actually insert data in an SKG database, you need to run the import files that are prepared by the previous command. Either you ask ontoweave to run it for you:

ontoweave my_data.csv:my_mapping.yaml --import-script-run

or you can capture the import script path and run it yourself:

script=$(ontoweave my_data.csv:my_mapping.yaml) # Capture.
$script # Run.

You will find more options by running the help command:

ontoweave --help

Mapping API

OntoWeaver essentially creates a Biocypher adapter from the description of a mapping from a table to ontology types. As such, its core input is a dictionary, that takes the form of a YAML file. This configuration file indicates:

  • to which (node) type to map each line of the table,
  • to which (node) type to map columns of the table,
  • with which (edge) types to map relationships between nodes.

The following explanations assume that you are familiar with Biocypher's configuration, notably how it handles ontology alignment with schema configuration.

Common Mapping

The minimal configuration would be to map lines and one column, linked with a single-edge type.

For example, if you have the following CSV table of phenotypes/patients:

phenotype,patient
0,A
1,B

and if you target the Biolink ontology, using a schema configuration (i.e. subset of types), defined in your shcema_config.yaml file, as below:

phenotypic feature:
    represented_as: node
    label_in_input: phenotype
case:
    represented_as: node
    label_in_input: case
case to phenotypic feature association:
    represented_as: edge
    label_in_input: case_to_phenotype
    source: phenotypic feature
    target: case

you may write the following mapping:

row:
   rowIndex:
      # No column is indicated, but OntoWeaver will map the indice of the row to the node name.
      to_subject: phenotype
transformers:
    - map:
        column: patient # Name of the column in the table.
        to_object: case # Node type to export to (most probably the same as in the ontology).
        via_relation: case_to_phenotype # Edge type to export to.

This configuration will end in creating a node for each phenotype, a node for each patient, and an edge for each phenotype-patient pair:

          case to phenotypic
          feature association
                    ↓
           ╭───────────────────╮
           │              ╔════╪════╗
           │              ║pati│ent ║
           │              ╠════╪════╣
╭──────────┴──────────╮   ║╭───┴───╮║
│phenotypic feature: 0│   ║│case: A│║
╰─────────────────────╯   ║╰───────╯║
                          ╠═════════╣
╭─────────────────────╮   ║╭───────╮║
│          1          │   ║│   B   │║
╰──────────┬──────────╯   ║╰───┬───╯║
           │              ╚════╪════╝
           ╰───────────────────╯

Available Transformers

If you want to transform a data cell before exporting it as one or several nodes, you will use other transformers than the "map" one.

map

The map transformer simply extracts the value of the cell defined, and is the most common way of mapping cell values.

For example:

    - map:
        column: patient
        to_object: case

Although the examples usually define a mapping of cell values to nodes, the transformers can also used to map cell values to properties of nodes and edges. For example:

    - map:
        column: version
        to_property: version
        for_objects:
            - patient # Node type.
            - variant
            - patient_has_variant # Edge type.

split

The split transformer separates a string on a separator, into several items, and then inserts a node for each element of the list.

For example, if you have a list of treatments separated by a semicolon, you may write:

row:
   map:
      to_subject: phenotype
transformers:
    - map:
        column: variant
        to_object: variant
        via_relation: phenotype_to_variant
    - split:
        column: treatments
        from_subject: variant
        to_object: drug
        via_relation: variant_to_drug
        separator: ";"
     phenotype to variant      variant to drug
             ↓                       ↓
       ╭───────────────╮   ╭────────────────╮
       │         ╔═════╪═══╪═╦══════════════╪═════╗
       │         ║ vari│ant│ ║  treatments  │     ║
       │         ╠═════╪═══╪═╬══════════════╪═════╣
       │         ║     │   │ ║variant       │     ║
       │         ║     │   │ ║to drug       │     ║
╭──────┴─────╮   ║╭────┴───┴╮║  ↓    ╭──╮ ╭─┴────╮║
│phenotype: 0│   ║│variant:A├╫───────┤ X│;│drug:Y│║
╰────────────╯   ║╰─────────╯║       ╰┬─╯ ╰──────╯║
                 ╠═══════════╬════════╪═══════════╣
╭────────────╮   ║╭─────────╮║       ╭│ ╮ ╭──╮    ║
│      1     │   ║│    B    ├╫────────╯X ;│ Z│    ║
╰──────┬─────╯   ║╰────┬───┬╯║       ╰  ╯ ╰─┬╯    ║
       │         ╚═════╪═══╪═╩══════════════╪═════╝
       ╰───────────────╯   ╰────────────────╯

cat

The cat transformer concatenates the values cells of the defined columns and then inserts a single node. For example, the mapping below would result in the concatenation of cell values from the columns variant_id, and disease, to the node type variant. The values are concatenated in the order written in the columns section.

row:
   cat:
      columns: # List of columns whose cell values are to be concatenated
        - variant_id
        - disease
      to_subject: variant # The ontology type to map to

cat_format

The user can also define the order and format of concatenation by creating a format_string field, which defines the format of the concatenation. For example:

row:
   cat_format:
      columns: # List of columns whose cell values are to be concatenated
        - variant_id
        - disease
      to_subject: variant # The ontology type to map to
      # Enclose column names in brackets where you want their content to be:
      format_string: "{disease}_____{variant_id}"

string

The string transformer allows mapping the same pre-defined static string to properties of some nodes or edge types.

It only needs the string value, and then a regular property mapping:

    - string:
        value: "This may be useful"
        to_property: comment
        for_objects:
            - patient
            - variant

translate

The translate transformer changes the targeted cell value from the one contained in the input table to another one, as configured through (another) mapping, extracted from (another) table.

This is useful to reconciliate two sources of data using two different references for the identifiers of the same object. The translate transformer helps you translate one of the identifiers to the other reference, so that the resulting graph only uses one reference, and there is no duplicated information at the end.

For instance, let's say that you have two input tables providing information about the same gene, but one is using the HGCN names, and the other the Ensembl gene IDs:

Name Source
BRCA2 PMID:11207365
Gene Organism
ENSG00000139618 Mus musculus

Then, to map a gene from the second table (the one using Ensembl), you would do:

    - translate:
        column: Gene
        to_object: gene
        translations:
            ENSG00000139618: BRCA2

Of course, there could be hundreds of thousands of translations to declare, and you don't want to declare them by hand in the mapping file. Fortunately, you have access to another table in a CSV file, showing which one corresponds to the other:

Ensembl HGCN Status
ENSG00000139618 BRCA2 Approved

Then, to declare a translation using this table, you would do:

    - translate:
        column: Gene
        to_object: gene
        translations_file: <myfile.csv>
        translate_from: Ensembl
        translate_to: HGCN

To load the translation file, OntoWeaver uses Pandas' read_csv function. You may pass additional string arguments in the mapping section, they will be passed directly as read_csv arguments. For example:

    - translate:
        column: Gene
        to_object: gene
        translations_file: <myfile.csv.zip>
        translate_from: Ensembl
        translate_to: HGCN
        sep: ;
        compression: zip
        decimal: ,
        encoding: latin-1

replace

The replace transformer allows the removal of forbidden characters from the values extracted from cells of the data frame. The pattern matching the characters that are forbidden characters should be passed to the transformer as a regular expression. For example:

    - replace:
        columns:
            - treatment
        to_object: drug
        via_relation: alteration_biomarker_for_drug
        forbidden: '[^0-9]' # Pattern matching all characters that are not numeric. 
        # Therefore, you only allow numeric characters. 
        substitute: "_" # Substitute all removed characters with an underscore, in case they are  
        # located inbetween allowed_characters.

Here we define that the transformer should only allow numeric characters in the values extracted from the treatment column. All other characters will be removed and substituted with an underscore, in case they are located inbetween allowed characters.

By default, the transformer will allow alphanumeric characters (A-Z, a-z, 0-9), underscore (_), backtick (`), dot (.), and parentheses (), and the substitute will be an empty string. If you wish to use the default settings, you can write:

    - replace:
        columns:
            - treatment
        to_object: drug
        via_relation: alteration_biomarker_for_drug

Let's assume we want to map a table consisting of contact IDs and phone numbers.

id phone_number
Jennifer 01/23-45-67

We want to map the id column to the node type id and the phone_number column to the node type phone_number, but we want to remove all characters that are not numeric, using the default substitute (""), meaning the forbidden characters will only be removed, and not replaced by another character. The mapping would look like this:

row:
    map:
        column: id
        to_subject: id
transformers:
    - replace:
        column: phone_number
        to_object: phone_number
        via_relation: phone_number_of_person
        forbidden: '[^0-9]'

The result of this mapping would be a node of type phone_number, with the id of the node being 01234567, connected to a node of type id with the id Jennifer, via an edge of type phone_number_of_person.

User-defined Transformers

It is easy to create your own transformer, if you want to operate complex data transformations, but still have them referenced in the mapping.

This may even be a good idea if you do some pre-processing on the input table, as it keeps it obvious for anyone able to read the mapping (while it may be difficult to read the pre-processing code itself).

A user-defined transformer takes the form of a Python class inheriting from ontoweaver.base.Transformer:

class my_transformer(ontoweaver.base.Transformer):

    # The constructor is called when parsing the YAML mapping.
    def __init__(self, target, properties_of, edge=None, columns=None, **kwargs):

        # All the arguments passed to the super class are available as member variables.
        super().__init__(target, properties_of, edge, columns, **kwargs)

        # If you want user-defined parameters, you may get them from
        # the corresponding member variables (e.g. `self.my_param`).
        # However, if you want to have a default value if they are not declared
        # by the user in the mapping, you have to get them from kwargs:
        self.my_param = kwargs.get("my_param", None) # Defaults to None.

    # The call interface is called when processing a row.
    def __call__(self, row, index):

        # You should take care of your parameters:
        if not self.my_param:
            raise ValueError("You forgot the `my_param` keyword")

        # The columns declared by the user (with the "column(s)" keyword)
        # are available as a member variable:
        for col in self.columns:
            # Some methods of base.Transformer may be useful, like `valid`
            # which checks whether a cell value is something useful.
            if self.valid(row[col]):
                result = row[col]
                # […] Do something of your own with row[col] […]
                # You are finally required to yield a string:
                yield str(result)

Once your transformer class is implemented, you should make it available to the ontoweaver module which will process the mapping:

ontoweaver.transformer.register(my_transformer)

You can have a look at the transformers provided by OntoWeaver to get inspiration for your own implementation: ontoweaver/src/ontoweaver/transformer.py

Keywords Synonyms

Because several communities gathered around semantic knowledge graphs, several terms can be used (more or less) interchangeably.

OntoWeaver thus allows you to use your favorite vocabulary to write down the mapping configurations.

Here is the list of available synonyms:

  • subject = row = entry = line = source
  • column = columns = fields
  • to_object = to_target = to_node
  • from_subject = from_source
  • via_relation = via_edge = via_predicate
  • to_property = to_properties
  • for_object = for_objects

How To

How to Add Properties to Nodes and Edges

If you do not need to create a new node, but simply attach some data to an existing node, use the to_property predicate, for example:

row:
   map:
      to_subject: phenotype
transformers:
    - map:
        column: patient
        to_object: case
        via_relation: case_to_phenotype
    - map:
        column: age
        to_property: patient_age
        for_object: case

This will add a "patient_age" property to nodes of type "case".

Note that you can add the same property value to several property fields of several node types:

    - map:
        column: age
        to_properties:
            - patient_age
            - age_patient
        for_object:
            - case
            - phenotype

How to Extract Additional Edges

Edges can be extracted from the mapping configuration, by defining a from_subject and to_object in the mapping configuration, where the from_subject is the node type from which the edge will start, and the to_object is the node type to which the edge will end.

For example, consider the following mapping configuration for the sample dataset below:

id	patient	        sample
0	patient1	sample1
1	patient2	sample2
2	patient3	sample3
3	patient4	sample4
row:
    map:
        column: id
        to_subject: variant
transformers:
    - map:
          column: patient
          to_object: patient
          via_relation: patient_has_variant
    - map:
          column: sample
          to_object: sample
          via_relation: variant_in_sample

If the user would like to extract an additional edge from the node type patient to the node type sample, they would need to add the following section to the transformers in the mapping configuration:

    - map:
        column: patient
        from_subject: sample
        to_object: patient
        via_relation: sample_to_patient

How to add the same metadata properties to all nodes and edges

Metadata can be added to nodes and edges by defining a metadata section in the mapping configuration. You can specify all the property keys and values that you wish to add to your nodes and edges in a metadata section. For example:

metadata:
        - name: oncokb
        - url: https://oncokb.org/
        - license: CC BY-NC 4.0
        - version: 0.1

The metadata defined in the metadata section will be added to all nodes and edges created during the mapping process.

How to add the column of origin as a property to all nodes

In addition to the user-defined metadata, a property field add_source_column_names_as is also available. It allows to indicate the column name in which the data was found, as a property to each node. Note that this is not added to edges, because they are not mapped from a column per se.

For example, if the label of a node is extracted from the "indication" column, and you indicate add_source_column_name_as: source_column, the node will have a property: source_column: indication.

This can be added to the metadata section as follows:

metadata:
        - name: oncokb
        - url: https://oncokb.org/
        - license: CC BY-NC 4.0
        - version: 0.1
        - add_source_column_names_as: sources

Now each of the nodes contains a property sources that contains the names of the source columns from which it was extracted. Be sure to include all the added node properties in the schema configuration file, to ensure that the properties are correctly added to the nodes.

How to create user-defined adapters

You may manually define your own adapter class, inheriting from the OntoWeaver's class that manages tabular mappings.

For example:

class MYADAPTER(ontoweaver.tabular.PandasAdapter):

    def __init__(self,
        df: pd.DataFrame,
        config: dict,
        type_affix: Optional[ontoweaver.tabular.TypeAffixes] = ontoweaver.tabular.TypeAffixes.prefix,
        type_affix_sep: Optional[str] = "//",
    ):
        # Default mapping as a simple config.
        from . import types
        parser = ontoweaver.tabular.YamlParser(config, types)
        mapping = parser()

        super().__init__(
            df,
            *mapping,
        )

When manually defining adapter classes, be sure to define the affix type and separator you wish to use in the mapping. Unless otherwise defined, the affix type defaults to suffix, and the separator defaults to :. In the example above, the affix type is defined as prefix and the separator is defined as //. If you wish to define affix as none, you should use type_affix: Optional[ontoweaver.tabular.TypeAffixes] = ontoweaver.tabular.TypeAffixes.none, and if you wish to define affix type as suffix, use type_affix: Optional[ontoweaver.tabular.TypeAffixes] = ontoweaver.tabular.TypeAffixes.suffix.

How to access dynamic Node and Edge Types

OntoWeaver relies a lot on meta-programming, as it actually creates Python types while parsing the mapping configuration. By default, those classes are dynamically created into the ontoweaver.types module.

You may manually define your own types, derivating from ontoweaver.base.Node or ontoweaver.base.Edge.

The ontoweaver.types module automatically gathers the list of available types in the ontoweaver.types.all submodule. This allows accessing the list of node and edge types:

node_types  = types.all.nodes()
edge_types  = types.all.edges()

Parallel Processing

OntoWeaver provides a way to parallelize the extraction of nodes and edges from the provided database, with the aim of reducing the runtime of the extraction process. By default, the parallel processing is disabled, and the data frame is processed in a sequential manner. To enable parallel processing, the user can pass the maximum number of workers to the extract_all function.

For example, to enable parallel processing with 16 workers, the user can call the function as follows:

adapter = ontoweaver.tabular.extract_all(table, mapping, parallel_mapping = 16)

To enable parallel processing with a good default working on any machine, you can use the approach suggested by the concurrent module.

import os
adapter = ontoweaver.tabular.extract_all(table, mapping, parallel_mapping = min(32, (os.process_cpu_count() or 1) + 4))

Information Fusion

When integrating several sources of information to create your own SKG, you will inevitably face a case where two sources provide different information for the same object. If you process each source with a separate mapping applied to separate input tables, then each will provide the same node, albeit with different properties.

This is an issue, as BioCypher does not provide a way to fuse both nodes in a single one, while keeping all the properties. As of version 0.5, it will use the last seen node, and discard the first one(s), effectively loosing information (albeit with a warning). With a raw Biocypher adapter, the only way to solve this problem is to implement a single adapter, which reconciliate the data before producing nodes, which makes the task difficult and the adapter code even harder to understand.

Reconciliation

OntoWeaver provides a way to solve the reconciliation problem with its high-level information fusion feature. The fusion features allow to reconciliate the nodes and edges produced by various independent adapters, by adding a final step on the aggregated list of nodes and edges.

The generic workflow is to first produce nodes and edges —as usual— then call the fusion.reconciliate function on the produced nodes and edges:

# Call the mappings:
adapter_A = ontoweaver.tabular.extract_all(input_table_A, mapping_A)
adapter_B = ontoweaver.tabular.extract_all(input_table_B, mapping_B)

# Aggregate the nodes and edges:
nodes = adapter_A.nodes + adapter_B.nodes
edges = adapter_A.edges + adapter_B.edges

# Reconciliate:
fused_nodes, fused_edges = ontoweaver.fusion.reconciliate(nodes, edges, separator=";")

# Then you can pass those to biocypher.write_nodes and biocypher.write_edges...

High-level Interface

OntoWeaver provides the fusion.reconciliate function, that implements a sane default reconciliation of nodes. It merges nodes having the same identifier and the same type, taking care of not losing any property. When nodes have the same property field showing different values, it aggregates the values in a list.

This means that if the two following nodes come from two different sources:

# From source A:
("id_1", "type_A", {"prop_1": "x"}),
("id_1", "type_A", {"prop_2": "y"}),

# From source B:
("id_1", "type_A", {"prop_1": "z"})

Then, the result of the reconciliation step above would be:

# Note how "x" and "z" are separated by separator=";".
("id_1", "type_A", {"prop_1": "x;z", "prop_2": "y"})

Mid-level Interfaces

The simplest approach to fusion is to define how to:

  1. decide that two nodes are identical,
  2. fuse two identifiers,
  3. fuse two type labels, and
  4. fuse two properties dictionaries, and then
  5. let OntoWeaver browse the nodes by pairs, until everything is fused.

For step 1, OntoWeaver provides the serialize module, which allows to extract the part of a node or an edge) that should be used when checking equality.

A node being composed of an identifier, a type label, and a properties dictionary, the serialize module provides function objects reflecting the useful combinations of those components:

  • ID (only the identifier)
  • IDLabel (the identifier and the type label)
  • All (the identifier, the type label, and the properties)

The user can instantiate those function objects, and pass them to the congregate module, to find which nodes are duplicates of each other. For example:

on_ID = serialize.ID() # Instantiation.
congregater = congregate.Nodes(on_ID) # Instantiation.
congregater(my_nodes) # Actual processing call.
# congregarter now holds a dictionary of duplicated nodes.

For steps 2 to 4, OntoWeaver provides the merge module, which provides ways to merge two nodes' components into a single one. It is separated into two submodules, depending on the type of the component:

  • string for components that are strings (i.e. identifier and type label),
  • dictry for components that are dictionaries (i.e. properties).

The string submodule provides:

  • UseKey: replace the identifier with the serialization used at the congregation step,
  • UseFirst/UseLast: replace the type label with the first/last one seen,
  • EnsureIdentical: if two nodes' components are not equal, raise an error,
  • OrderedSet: aggregate all the components of all the seen nodes into a single, lexicographically ordered list (joined by a user-defined separator).

The dictry submodule provides:

  • Append: merge all seen dictionaries in a single one, and aggregate all the values of all the duplicated fields into a single lexicographically ordered list (joined by a user-defined separator).

For example, to fuse "congregated" nodes, one can do:

    # How to merge two components:
    use_key    = merge.string.UseKey() # Instantiation.
    identicals = merge.string.EnsureIdentical()
    in_lists   = merge.dictry.Append(separator)

    # Assemble those function objects in an object that knows
    # how to apply them members by members:
    fuser = fuse.Members(base.Node,
            merge_ID    = use_key,    # How to merge two identifiers.
            merge_label = identicals, # How to merge two type labels.
            merge_prop  = in_lists,   # How to merge two properties dictionaries.
        )

    # Apply a "reduce" step (browsing pairs of nodes, until exhaustion):
    fusioner = Reduce(fuser) # Instantiation.
    fusioned_nodes = fusioner(congregater) # Call on the previously found duplicates.

Once this fusion step is done, is it possible that the edges that were defined by the initial adapters refer to node IDs that do not exist anymore. Fortunately, the fuser keeps track of which ID was replaced by which one. And this can be used to remap the edges' target and source identifiers:

remaped_edges = remap_edges(edges, fuser.ID_mapping)

Finally, the same fusion step can be done on the resulting edges (some of which are now duplicates, because they were remapped):

    # Find duplicates:
    on_STL = serialize.edge.SourceTargetLabel()
    edges_congregater = congregate.Edges(on_STL)
    edges_congregater(edges)

    # How to fuse them:
    set_of_ID       = merge.string.OrderedSet(separator)
    identicals      = merge.string.EnsureIdentical()
    in_lists        = merge.dictry.Append(separator)
    use_last_source = merge.string.UseLast()
    use_last_target = merge.string.UseLast()
    edge_fuser = fuse.Members(base.GenericEdge,
            merge_ID     = set_of_ID,
            merge_label  = identicals,
            merge_prop   = in_lists,
            merge_source = use_last_source,
            merge_target = use_last_target
        )

    # Fuse them:
    edges_fusioner = Reduce(edge_fuser)
    fusioned_edges = edges_fusioner(edges_congregater)

Because all those steps are performed onto OntoWeaver's internal classes, they need to be converted back to Biocypher's tuples:

    return [n.as_tuple() for n in fusioned_nodes], [e.as_tuple() for e in fusioned_edges]

Low-level Interfaces

Each of the steps mentioned in the previous section involves a functor class that implements a step of the fusion process. Users may provide their own implementation of those interfaces, and make them interact with the others.

The first function interface is the congregate.Congregater, whose role is to consume a list of Biocypher tuples, find duplicated elements, and store them in a dictionary mapping a key to a list of elements.

This allows to implementation of a de-duplication algorithm with a time complexity in O(n·log n).

A Congregater is instantiated with a serialize.Serializer, indicating which part of an element is to be considered when checking for equality.

The highest level fusion interface is fusion.Fusioner, whose role is to process a congregate.Congregater and return a set of fusioned nodes.

OntoWeaver provides fusion.Reduce as an implementation of Fusioner, which itself relies on an interface whose role is to fuse two elements: fuse.Fuser.

OntoWeaver provides a fuse.Members as an implementation, which itself relies on merge.Merger, which role is to fuse two elements' components.

So, from the lower to the higher level, the following three interfaces can be implemented:

  1. merge.Merger —(used by)→ fuse.Members —(used by)→ fusion.Reduce
  2. fuse.Fuser —(used by)→ fusion.Reduce
  3. fusion.Fusioner

For instance, if you need a different way to merge elements components, you should implement your own merge.Merger and use it when instantiating fuse.Members.

If you need a different way to fuse two elements (for instance for deciding their type based on their properties), implement a fuse.Fuser and use it when instantiating a fusion.Reduce.

If you need to decide how to fuse whole sets of duplicated nodes (for instance if you need to know all duplicated nodes before deciding which type to set), implement a fusion.Fusioner directly.