diff --git a/docs/figures/.DS_Store b/docs/figures/.DS_Store deleted file mode 100644 index 5008ddf..0000000 Binary files a/docs/figures/.DS_Store and /dev/null differ diff --git a/docs/figures/incremental-1.png b/docs/figures/incremental-1.png deleted file mode 100644 index 3b42ca4..0000000 Binary files a/docs/figures/incremental-1.png and /dev/null differ diff --git a/docs/figures/incremental-2.png b/docs/figures/incremental-2.png deleted file mode 100644 index 05cb0ba..0000000 Binary files a/docs/figures/incremental-2.png and /dev/null differ diff --git a/docs/figures/incremental-3.png b/docs/figures/incremental-3.png deleted file mode 100644 index a1d7387..0000000 Binary files a/docs/figures/incremental-3.png and /dev/null differ diff --git a/docs/incremental.md b/docs/incremental.md deleted file mode 100644 index 1a63a35..0000000 --- a/docs/incremental.md +++ /dev/null @@ -1,174 +0,0 @@ -## Incremental network construction using `hivnetworkcsv` - -> Sergei L Kosakovsky Pond (spond@temple.edu) - -> 2020-01-09 v.1 - -For cases when a transmission network is built periodically from surveillance data, with a subsequent network (network `T`, for today) being (for the most part, see below) an expansion of a previous network (netwotk `Y`, for yesterday), `hivnetworkcsv` provides a **previous network** mode, which accomplishes the following, **assuming the `-j` (JSON output) option is selected** - -1. Clusters in network `T` are matched, whenever practical, with clusters in network `Y`, so that _consistent_ cluster naming is maintained. -2. Clusters in network `T` are **annotated** to explain how they compare to clusters in network `Y`. Annotation options for cluster `c` in network `T` are (see examples below). The annotation information is stored in the output JSON dictionary under the key `Cluster description` - 1. `existing` : cluster `c` is identical to a cluster in network `Y` 2. `expanded` : cluster `c` expands (adds node to) a cluster in network `Y`; this also includes the case of a contraction if some nodes were deleted from network `Y`. 3. `merged` : cluster `c` merges two or more clusters from network `Y`, and possibly adds/removes nodes from them; example - 4. `new`: cluster `c` consists of nodes that did not appear in clusters of network `Y`; this also includes the case when a cluster in network `Y` split into two more more smaller clusters in network `T`. -3. Nodes in network `T` are **annotated** to explain how they relate to nodes in network `Y` (that these attributes are **not** mututally exclusive) - 1. `new_node` : this node does not appear in network `Y` - 2. `new_cluster` : this node belongs to a cluster in network `T` which has no match in cluster `Y` - 3. `moved_clusters`: this node was present in network `Y`, is present in network `T`, but moved to a different cluster - 4. *no attribute*: this node was present in network `Y`, is present in network `T`, and remains in the matched cluster -4. Edges in network `T` are **annotated** to explain how they relate to Edges in network `Y` - 1. `added-to-prior`: an edge in network `T` was not present in network `Y`. - -### Example - -The `Y` (original) network is created by calling - -``` -hivnetworkcsv -t 0.015 -f plain -j -i tests/incremental/original.csv --O tests/incremental/original.json -``` - -It consists of **5** clusters, with node names reflecting which cluster they belong to (e.g. `A1` belongs to cluster `1`) - -**Figure 1** The existing network. - -![Network `Y`](figures/incremental-1.png) - -The `N` network is created in the incremental mode by calling. - -``` -hivnetworkcsv -t 0.015 -f plain -j -i tests/incremental/modified.csv \ --P tests/incremental/original.json -O tests/incremental/modified.json -``` - -> This example adds **and** removes nodes to the existing network - - -The following diagnostic messages are written to `stderr` - -``` -[WARNING] Removed 2 nodes from the previous network, E2, A1 -Cluster 1 matches previous cluster 1 -Cluster 2 matches previous cluster 1 -[WARNING] Cluster 1 from the existing network is mapped to multiple new clusters / singletons (nodes E1, F1) -Cluster 3 MERGES 3, 4, None -Cluster 4 matches previous cluster 2 -[WARNING] Incostistent singletons, nodes previously clustered became unclustered ({4}) -Cluster 5 matches previous cluster 5 -Cluster 6 is a new cluster -Added 4 edges compared to the prior network -``` - -**Figure 2** The updated network. - -![Network `T`](figures/incremental-2.png) - -**Figure 3** The changes from `Y` to `T`. - -![Network `T`](figures/incremental-3.png) - -1. Cluster `1` split in two clusters because node `A1` was deleted. The larger remaining cluster (3 nodes) inherited the cluster name (`1`), and the second remaining cluster (2 nodes, E1 and F1) was given a new name (`7`) and the nodes in that cluster were labeled as `moved`. In the output JSON this is annotated with 2. Cluster `2` lost one node (E2) due to deletion -3. Clusters `3` added a node (`E3`) and merged with cluster `4` through a link between the new node in cluster `4` (E4), and cluster `3`. Cluster `3` also lost a node because the link between `B4` and `D4` was removed from network `T` (e.g., because the sequence for `D3` was revised). The resulting merged cluster inherited the ID of cluster `3`; because the existing clusters were of the same size and had to date information, the cluster with the lexicographically lowest node name (`A3`) inherited the name. Note, that cluster ID `4` has been retired, and will not appear in subsequent networks. -4. Cluster `5` is unchanged. -5. Cluster `6` is new, i.e., none of its nodes were clustered in network `Y`. - - -### Additional information in JSON - -Relevant sections of the output JSON are shown below - -``` -"Cluster description": { - "1": { - "type": "extended", - "size": 3, - "old_size": 6 - }, - "2": { - "type": "extended", - "size": 4, - "old_size": 5 - }, - "3": { - "type": "merged", - "size": 9, - "old_clusters": [ - 3, - 4 - ], - "sizes": [ - 4, - 4 - ], - "old_size": 8, - "new_nodes": 2 - }, - "5": { - "type": "existing" - }, - "6": { - "type": "new", - "size": 2, - "new_nodes": 2 - }, - "7": { - "type": "new", - "size": 2, - "moved": 2 - } - }, -... - -"Nodes" : [ -... - { - "id": "A6", - "cluster": 6, - "attributes": [ - "new_node", - "new_cluster" - ] - }, -... - { - "id": "F1", - "cluster": 7, - "attributes": [ - "moved_clusters", - "new_cluster" - ] - }, -... - - { - "id": "E4", - "cluster": 3, - "attributes": [ - "new_node" - ] - }, -... -], - -"Edges" : [ - ... - { - "source": 7, - "target": 8, - "directed": false, - "length": 0.01, - "support": 0, - "removed": false, - "sequences": [ - "D3", - "E3" - ], - "attributes": [ - "BULK", - "added-to-prior" - ] - } - ... -] - -``` - diff --git a/docs/subclusters.md b/docs/subclusters.md deleted file mode 100644 index 442a44b..0000000 --- a/docs/subclusters.md +++ /dev/null @@ -1,149 +0,0 @@ -## Subcluster construction using `hivnetworkcsv` - -> Sergei L Kosakovsky Pond (spond@temple.edu) - -> 2020-01-09 v.1 - -Starting with version `0.6`, `hivnetworkcsv` allows the specification of multiple, comma-separated distance thresholds via the `-t` argument. This functionality permits inference of **nested** networks at different distance thresholds, i.e., the computation of subclusters of more "closely" linked nodes within clusters defined at a more permissive distance threshold. - -> ###This feature only affects runs where the JSON output (`-j` option) is specified. - -To illustrate this feature, we will use the dataset `tests/subclusters/pirc.csv` which is a distance file constructed from `tests/subclusters/pirc.msa` -- sequences from [PMID 24901437](https://www.ncbi.nlm.nih.gov/pubmed/24901437). - -Build the network, extracting date information from sequence names, and limiting the network to sequences from up to year 2009 - -``` -hivnetworkcsv -j -i tests/subclusters/pirc.csv -t 0.02,0.01,0.005 -f regexp \ --p 0 "^([^|]+)\|([0-9]+)" -p 0 "^([^|]+)\|([0-9]+)" --before 20090101 \ --O tests/subclusters/pirc-2009.json -``` - -`stderr` output reports high level diagnostics. - -``` -At threshold 0.01 there were 77 subclusters -At threshold 0.005 there were 54 subclusters -``` - - -This command builds the transmission network using the maximal thershold specified in the `-t` argument (0.02), and then creates further annotation of networks at 0.01 and 0.005 that are contained in the larger network. By construction, using the same input data, network at a **lower** distance threshold will be a subset of the netwotk inferred at a higher threshold. - -Additional annotation in the JSON file takes the following form. - -1. A record in the `Settings` object
-"Settings": { - "threshold": 0.02, - "edge-filtering": null, - "contaminants": null, - "compact_json": false, - "created": "2020-01-09T22:10:33.833333+00:00", - "additional thresholds": [ - 0.01, - 0.005 - ] - } --2. A dictionary `Subclusters` which, for each threshold, reports clusters at that thershold. A cluster record is of the form
-"ID" : [node1 index, node2 index, ... nodek index] -where `node_index` entries index into the `Node` array, also a part of the JSON output. For example,
-"Subclusters": { - "0.01": { - "2.1": [ - 24, - 25, - 26, - 27, - 28, - 29, - 30, - 31, - 32, - 33, - 34, - 35, - 36, - 37, - 38, - 39, - 40, - 41, - 42 - ], - "19.1": [ - 135, - 137 - ], - "3.1": [ - 43, - 44, - 45, - 46, - 47, - 48, - 49, - 50 - ], - ... -- 3. Individual node records in the `Node` array which belong to subclusters, are annotated as follows
- { - "id": "KJ723207", - "cluster": 9, - "subcluster": [ - [ - 0.01, - "9.2" - ], - [ - 0.005, - "9.2" - ] - ] - ... - } -This node, in addition to belonging to cluster `9` at `t=0.02`, belongs to cluster `9.2` (a part of the larger cluster `9`) at both of the smaller thresholds.
-{ - "id": "KJ723391", - "cluster": 14, - "baseline": [ - 2002, - 1, - 1, - 0, - 0, - 0, - 1, - 1, - -1 - ], - "subcluster": [ - [ - 0.01, - "14.1" - ] - ] - } -This node belongs to cluster `14` at `t=0.02`, cluster `14.1` at `t=0.01`, and no cluster at `t=0.005` - -## Combining subcluster construction with incremental network building. - -(See `incremental.md` for background on incremental construction). - -When `-P` option is provided with `-t` and multiple thresholds, `hivnetworkcsv` will attempt to map subclusters from the current network to the subclusters from the previous network. The rule for matching subclusters is as follows. - -If the current network has a subcluster, `C`, at distance threshold, `t`, **and** the previous network has a subcluster `Cp` at the same distance thereshold, then `C` will be matched with `Cp` *if and only if* - -0. The subclusters each belong to the respective matched clusters. -1. The interesection of `C` and `Cp` is non-empty (i.e. **some** of the nodes from `C` are in `Cp`). -2. The intersection of `C` with any other previous network subcluster from the same cluster is empty, i.e. `C` does not overlap with any other existing subclusters. -3. `Cp` does not overlap with any current network subclusters other that `C` - -For example, - -``` -hivnetworkcsv -i tests/subclusters/pirc.csv -t 0.02,0.01,0.005 \ --f regexp -p 0 "^([^|]+)\|([0-9]+)" -p 0 "^([^|]+)\|([0-9]+)" \ --P tests/subclusters/pirc-2009.json -j -O tests/subclusters/pirc.json -``` - -