# Version 0.2: The big Pydantic migration

This commit inaugurates the transition to sinn v0.2. It is a near complete rewrite of the entire stack, based on the lessons learned from v0.1. The because fundamental changes are: - Integration of `Pydantic <https://pydantic-docs.helpmanual.io>`_ throughout. - Extension of a few, dispersed axis functions into a coherent, full-featured axis & indexing module. - Consistent automatic compilation - Addition of unit tests - A documentation skeleton ## Integration of Pydantic Every core object is now a Pydantic `BaseModel`. This has benefits in terms of specifying and validating intialization inputs, and allows a declarative specification of parameters. Declarative specifications are especially especially natural for `Model`, where composability and extensability are important features. The definition of model components (`History`, `Axis`, `Kernel`) as subclasses of `BaseModel` allows us to make full use of this composability. However, the biggest advantages of using Pydantic are when embedding models within larger workflows. Pydantic objects can be easily (de)serialized, making it simple and reproducible to read/write parameter sets, model components or entire models to disk. Any special packing/unpacking code is within the object itself, allowing standardized codes for passing parameters and results between data processing stages. ## Axis module The axis module was created to address a few issues: 1. Sinn v0.1 defined different incompatible axis concepts, for manipulating time arrays (`History`, simulations) and rescaling, rebinning data (`AxisData`, Bayesian inference, data analysis). 2. The time axis was fully enmeshed in the `History` class. It started as a simple second array, but grew into multiple specialized methods as needs arose. 3. Conversions between time axes of different histories was unintuitive and error-prone. 4. Being able to index a history by value instead of index was really useful: it allows to index different histories consistently without worrying about padding, and is convenient in interactive analysis sessions. However conversion to and from values brings a whole host of numerical rounding issues in simulation code. (In interactive sessions these weren't too important.) The *axis.py* module collects all the functionality for time and data axes into a single consistent class. The more focussed module allowed for more precisely defined behaviour, and lightened the `History` class which had grown too large. The collection of pre-existing functionality represents maybe 20% of the new *axis.py* module. The rest of the 80% adds the important new functionality of `AxisIndex` classes. Axis indices are structured in a hierarchy of types: similar to how Python's :mod:`datetime` defines *time* and *timedelta* objects, we have *axis* and *axisdelta* objects. We also distinguish between numeric and symbolic indices, a requirement for the redesigned Theano compilation algorithm. Finally, each axis defines its own index types ((absolute, delta) x (numeric, symbolic)) – this allows histories to transparently convert indices as required, solving point 3 above. The complex relationships between all axis index types is captured with an abstract class hierarchy, enabling easy checking of index type at any level of granularity (any delta index type vs any index type corresponding to *this* axis). The concepts of “axis index” and “data index” are now clearly separated. For an axis index, there is no wrapping: the position at `-1` is simply more to the left than `0`, provided it exists. This allowed us to avoid the issues with numerical errors, without losing the convenience of value indexing in interactive sessions. ## Consistent automatic compilation Version 0.1 of sinn saw multiple iterations of algorithms for constructing the Theano functions declaratively defined in models. This culminated in the pattern of “anchored computational graphs“ and connected/disconnected histories. In version 0.2, all obsolete compilation code was removed, and the pattern made coherent throughout. This lead to massive simplifications to the code underlying `Model.advance()`, and huge improvements in the reliability of that code. ## Unit tests All new code is now covered by unit tests. This both massively improves reliability, and provides for a basic form of documentation of the API. ## Documentation Documentation is now a an incoherent set of pages written with varying levels of completeness. This is a big improvement over the previous state of no documentation to speak of.
mackelab · Jun 15, 2020 · 2a6c072 · 2a6c072
1 parent a00f850
commit 2a6c072
Show file tree

Hide file tree

Showing 68 changed files with 16,595 additions and 8,194 deletions.
diff --git a/.gitignore b/.gitignore
@@ -14,12 +14,21 @@ debuglog*
 data
 TMPNOTES
 venv
+snippets
 tmp*
 *.log*
 *.cprof
 _build
+_autosummary
+*.coverage
 *.html
 *.cache
 *.compilecache
+*???final?.dat
+*???final?.dir
+  # Theano graph cache files
 activate
   # symbolic link to environment
+*.Rproj*
+.Rprofile
+.RData
diff --git a/README.md b/README.md
@@ -1,35 +1,38 @@
 # Sinn
 
-*sinn* is a library for both *S*imulation and *IN*ference of dynamical systems. It seeks to provide a flexible framework for building complex mathematical models which are fully compatible with machine learning libraries, allowing them to be differentiated through and compiled to C code.
+*sinn* is a library for both *S*imulation and *IN*ference of dynamical systems. It provides a flexible framework for building complex mathematical models which are fully compatible with machine learning libraries, allowing almost arbitrary cost functions to be differentiated all the way back to the model parameters. Both the simulation model and cost function are compiled and run as C code.
 
 ## Motivation
 
-Optimization frameworks like Theano and TensorFlow provide powerful capabilities for fitting models to data. However, they are most tailored to fitting neural networks – implementing the type of dynamic mechanistic models often found in physics or applied mathematics within these frameworks remains error-prone and labour-intensive. Sinn provides a set of high-level constructs designed to integrate well with lower-level optimization libraries.
+Optimization frameworks like TensorFlow and PyTorch provide powerful capabilities for fitting models to data. However, they are most tailored to fitting neural networks, and implementing the type of dynamic mechanistic models often found in physics or applied mathematics within these frameworks remains error-prone and labour-intensive. *sinn* provides a set of high-level constructs designed to bridge the gap between the mathematical language of dynamical systems and the interface of a machine-learning library.
 
 Sinn was originally developed in order to infer a mesoscopic neuron model (René et al., in press, Neural Computation; [arXiv](https://arxiv.org/abs/1910.01618)).
 
+## Documentation
+
+Partial documentation can be found on [Read the Docs](https://sinn.readthedocs.io/en/latest/). This will improve as development continues.
+
 ## Features
 
-- Automatic differentiation and C-compilation provided by Theano.
+- Automatic differentiation and C-compilation provided by Theano, although one [could](#dependence-on-a-machine-learning-library) in theory use other use other machine-learning frameworks.
 
 - Compatible with PyMC3
 
   Make your model probabilistic with a few extra lines of code, for easy implementation of Bayesian inference and Monte Carlo sampling.
 
-- Use optimization library only when desired.
+- Use the optimization library only when desired.
 
   No code change is required to run models with either Numpy or Theano – the single line `shim.load('theano')` suffices to load the optimization library.
-  Since a pure Numpy model does not require compilation every time it is run, allowing you to first develop your model faster with more easily traceable errors, and then benefit from the C-acceleration and automatic differentiation by loading the optimization library.
+  Since a pure Numpy model does not require compilation every time it is run, this allows you to first develop your model faster with more easily traceable errors, and then benefit from the C-acceleration and automatic differentiation by loading the optimization library.
 
 - Data structures which map naturally to the mathematical models
 
   + `Axis`: Unit-aware structure for continuous quantities such as time, space, temperature…
-  + `DataAxes`: combining *n*-dimensional data with *n* axes.
-    A development goal is to allow easier translation to Pandas' analogous `DataFrame` (the main difference between a frame and an axis being that the latter is continuous by design)
-  + `History`: A `DataAxes` instance where one axis is time.
-
-  **Note** This organization of `Axis`, `DataAxes` and `History` is still incomplete work in progress and subject to change.
+  + `DynArray`: combining *n*-dimensional data with *n* axes.
+    A development goal is to allow easier translation to PyData's analogous `DataArray` (the main difference being that a `DynArray` is intended for data generation, and is associated to a function which can fill the entries *dynamically*).
+  + `History`: A `DynArray` instance where one axis is time.
 
+  **Note** This organization of `Axis`, `DynArray` and `History` is still work in progress and subject to change.
 
 - Dynamic programming, aka lazy evaluation.
 
@@ -40,12 +43,49 @@ Sinn was originally developed in order to infer a mesoscopic neuron model (René
 
     and then compute either *x* or *y* at any point *k<sup>\*</sup>*, without worrying¹ about the fact that f<sub>x</sub> and f<sub>y</sub> both depend on the arrays *x* and *y*, and without unnecessary calculations for points beyond *k<sup>\*</sup>*.
 
+- Fully serializable models
+  Models are implemented as [Pydantic](https://pydantic-docs.helpmanual.io) models, and can be easily exported as dictionaries or serialized JSON:
+
+      mymodel.dict()
+      mymodel.json()
+
+  This is especially useful for repeating part of a workflow with different models or parameters, or deploying a model to a remote machine. Archiving the exact parameterization of a model is also a key component in a reproducible science workflow.
+
 ¹ Unless of course there is a circular dependency between f<sub>x</sub> and f<sub>y</sub>.
 
+## Dependence on a machine-learning library
+
+The choice of Theano as an underlying ML library is largely historical, and while I would likely make a different choice today, I currently have no plans to change this because:
+
+  - It works ;-)
+  - I like the functional style of Theano. It is more “math-like” then the declarative and imperative approaches used by TensorFlow and PyTorch respectively. My (absolutely untested) assumption is that in theory this makes fewer corner case with regard to automatic differentation, and I care more about robust differentiation than easy specification of for loops. I suspect this is an observation shared by the [JaX](https://jax.readthedocs.io) development team.
+  - Static graph optimization at least in theory makes for faster executation.
+  - Everything Theano related is routed through the [theano_shim](https://github.com/mackelab/theano_shim) backend, which removes much of the pain of debugging Theano code.
+
+This last point also means that to use a different ML framework one would only need to port `theano_shim`. Most of the code in the backend is of the form
+
+    import theano.tensor as tt
+    def exp(x):
+        if symbolic(x):
+            return tt.exp
+        else:
+            return np.exp
+
+(The required changes for a TensorFlow-compatible function are left as an exercise to the reader ;-). ) While not a negligible undertaking, porting `theano_shim` is thus certainly feasible.
+
 ## Development status
 
-At present Sinn is at a pre-alpha stage of development. Some of the core functionality is being reorganized to reduce code duplication and make model specification more intuitive. These changes should stabilize the API, but for now one should expect to run into the occasional bug.
-In any case users should treat this library as any fallible tool and check that it performs as expected in their situation.
+At present *sinn* is at a pre-alpha stage of development. Version 0.2 should settle the core API (everything related to the `History` and `Model` classes), but less mature elements may still see some changes.
+
+### v0.2dev release
+
+The current version is a near-complete rewrite of the library, with focus on eliminating stale and duplicated code, more natural model definitions, proper unit testing, and simpler integration into larger workflows. In particular, the use of *Pydantic* throughout means that model objects and parameters can be trivially saved and loaded from disk. Although some [planned changes](https://github.com/mackelab/sinn/issues/1) are still work in progress, the update of the core functionality is complete and already better tested than in the previous version.
+
+These changes were motivated by my own frustrations with v0.1, with regard to managing large numbers of simulations and workflows with multiple steps.
+
+### Disclaimer
+
+Although *sinn* tries hard to protect users from their own mistakes, users should still treat it as any fallible tool and check that it performs as expected in their situation.
 
 ## Installation
 
@@ -72,5 +112,17 @@ In any case users should treat this library as any fallible tool and check that
 
   As usual, if you want to be able to modify the code, add the `-e` flag to obtain a development installation.
 
+## Running tests
+
+  Install the development packages
+
+      pip install sinn[dev]
+
+  This will install `pytest` and `pytest-xdist`. Now run the test suite as
+
+      pytest --forked
+
+  The `--forked` option ensures that each test is run in a separate process. This is required to fully test *sinn* both with and without the auto-diff library loaded.
+
 
 Copyright (c) 2017-2020 Alexandre René
diff --git a/docs/axis.rst b/docs/axis.rst
@@ -0,0 +1,154 @@
+*********
+Data axes
+*********
+
+The `Axis` objects serve as the link between physical and digital dimensions.
+They are comparable to the `dims` and `coords` attributes
+of a PyData_ :class:`xarray.DataArray`, and similarly form the metadata
+frame around a data object. However, in contrast to the PyData objects, they
+may or may not be underlined by a NumPy array, and have specialized functions
+for index arithmetic.
+
+.. _PyData: http://xarray.pydata.org/en/stable/data-structures.html
+
+Definitions
+===========
+
+.. glossary::
+
+Axis
+  Physical axis, generally with associated units.
+
+Data index
+  *Index*, which is shifted such that its smallest value is always zero.
+  This is the value to use to index the underlying data.
+
+Digital dimension
+  A dimension consistent with the computer's storage; the basic example is
+  an axis of a software array.
+
+Index
+  Integer value representing a position along a digital dimension. This may
+  differ from the *data index* if an axis includes padding.
+
+Index delta
+  Difference between two indices. / Relative index.
+
+Mapping
+  Strictly-positive monotone function mapping an integer in digital space to
+  a stop in physical space.
+
+Padding
+  Extra stops added before (after) the *x:sub:`0`* (*x:sub:`n`*) position of a digital axis.
+  These may be used e.g. to initialize a dynamical system, or allow the
+  calculation of convolutions at axis bounds.
+  Adding left-padding will not affect the *index*, but **will** affect the
+  *data index* (by shifting it by an amount equal to the added padding).
+
+Stops
+  Physical values on an axis (one may think of ticks on a plot).
+
+x:sub:`0`
+  Position corresponding to the start of a discretized axis. If the axis
+  corresponds to time, this would usually be the start of the simulation.
+
+x:sub:`n`
+  Position corresponding to the end of a discretized axis. If the axis
+  corresponds to time, this would usually be the end of the simulation.
+
+.. _sinn-indexing:
+
+Indexing
+========
+
+Indexing in `sinn` works a little differently than what you may expect (in fact, that is the reason why in most cases, it *looks* like normal indexing). The first thing to note is that `sinn` does not recognize negative indices (or at least not the way you may expect them to). This is because a core task of `sinn` is to do index arithmetic for us, and it is *way* too easy to accidentally calculate a negative index (for example by trying to retrieve the time point before the origin). After one-too-many times of shooting myself in the foot, I realized that the small convenience of negative indexing (``axis[-i]`` vs. ``axis[axis.xnidx - i]``) was simply more troubled then it was worth and disallowed it entirely.
+
+Later I realized the reason for these issues: that I was using “index” to refer to two related, but conceptually different things: a “data index” and an “axis index”. A *data* index is tied to a structure in computer memory; index ``0`` is always the first element of this structure, and since it has fixed length ``L``, mapping ``-i`` to ``L-i`` is unambiguous. An *axis* index, on the other hand, is simply a position on an arbitrarily defined discretization of that axis. Index `0` is most naturally associated with the origin, *but this need not be the first discretized position*. Correspondingly, a negative index should then be associated with a position left of the origin.
+
+The thing to remember is this: because the structures defined by `sinn` are mathematical structure, *so too is indexing done in axis space*. This includes the interpretation of negative indices as being “more to the left” than 0. The translation into data indices is done internally and should be transparent. As an added bonus, this avoids hard-to-optimize conditionals in the computational graph.
+
+Throughout this library, a “position along a discretized axis” is referred to as a “stop”.
+
+.. Note:: Always use axis indices when passing as arguments between functions or methods; this includes internal functions. Resolve to data indices only when indexing the underlying data. (Remember that axis indices are also monotone and 1-to-1 mapped to data indices, and so just as informational as data indices.)
+
+Operations between indices
+==========================
+
+TODO
+
+.. Note::
+   Operations with plain integers are allowed, but since it is impossible to know whether a plain integer is an absolute or a relative index, they are necessarily ambiguous. The convention in sinn is to treat integers as absolute when *indexing* and testing *equality*, but as relative (deltas) when performing *operations*. This allows ``hist[5]``, ``hist.cur_tidx == 5`` and ``hist[i+1]`` to work as generally expected.
+
+   If this convention is ill-suited, or to remove any ambiguity, use ``hist.Index`` or ``hist.Index.Delta`` to cast the value to the appropriate::
+
+      hist[hist.Index(5)]
+      hist.cur_tidx == hist.Index(5)
+      hist[i + hist.Index.Delta(1)]
+
+Main classes
+============
+
+Axis types
+----------
+
+`Axis`
+  Base class for all axis, which represent a *physical* axis. Contains methods
+  for converting/checking physical units and converting to a transformed
+  space (if the axis was defined with a bijection).
+  Does **not** include any indexing functionality; this is a purely
+  “physical” concept.
+
+`DiscretizedAxis`  (Virtual)
+  Subclass of `Axis` which adds indexing. This is a virtual class because
+  it does not specify the structure used for indexing.
+
+`MapAxis`
+  A `DiscretizedAxis` where the indexing is provided by `SequenceMapping`.
+
+`RangeAxis`
+  A `DiscretizedAxis` where the indexing is provided by `RangeMapping`.
+
+`ArrayAxis`
+  A `DiscretizedAxis` where the indexing is provided by `ArrayMapping`.
+
+Mapping types
+-------------
+
+`SequenceMapping`
+  An object implementing an arbitrary mapping between indices and stops.
+  Is responsible for basic index arithmetic, including computing padding.
+  A `SequenceMapping` is iterable and supports indexed access.
+
+`RangeMapping`
+  A `SequenceMapping` where the indexing is implemented by a memory-efficient
+  `range` object and the stops are spaced regularly. Stops are computed with
+  index arithmetic, in the same way as `range`.
+
+`ArrayMapping`
+  A `SequenceMapping` where the stops are simply stored as an array.
+
+.. Note:: Part of the goal of the *padding* constructs is to make padding as invisible as possible. Consequently, when indexing a `*Mapping` with square brackets, one uses the *unpadded* index. (With regards to the definitions above, brackets use the *index* rather than the *data index*.)
+
+.. Warning:: Negative indices are explicitly disallowed on :py:class:`*Mapping` (or more specifically, they are just treated as "more left" than zero). With large amounts of index arithmetic, I found it too easy to accidentally obtain negative indices, and the hard-to-track bugs they introduce don't justify the minor convenience they bring. As a bonus, negative indices are sometimes useful to add padding stops without changing the position of '0'.
+
+Index types
+-----------
+
+`AxisIndex` and `AxisIndexDelta` are very similar, but are treated
+somewhat differently in operations. For
+example, instances of `AxisIndex` cannot be added together, while instances
+of `AxisIndexDelta` can. (c.f. operations table).
+
+One should not normally create these classes directly.
+
+`AxisIndex`
+  Absolute index. A new `AxisIndex` *class* is created dynamically every time
+  an `*Mapping` object is instantiated. This allows operations between indices
+  to check whether they refer to the same `Axis`.
+  The index reference is *x:sub:`0`*; to index into the data, use the
+  `.data_index` method to correct for padding.
+  Technically a subclass of `AxisIndexDelta`.
+
+`AxisIndexDelta`
+  Relative index. Created alongside `AxisIndex` when a `*Mapping` object is
+  instantiated.