Skip to content

Commit

Permalink
# Version 0.2: The big Pydantic migration
Browse files Browse the repository at this point in the history
This commit inaugurates the transition to sinn v0.2. It is a near 
complete rewrite of the entire stack, based on the lessons learned from 
v0.1. The because fundamental changes are:

  - Integration of `Pydantic <https://pydantic-docs.helpmanual.io>`_ 
throughout.
  - Extension of a few, dispersed axis functions into a coherent, 
full-featured axis & indexing module.
  - Consistent automatic compilation
  - Addition of unit tests
  - A documentation skeleton

## Integration of Pydantic

Every core object is now a Pydantic `BaseModel`. This has benefits in 
terms of specifying and validating intialization inputs, and allows a 
declarative specification of parameters. Declarative specifications are 
especially especially natural for `Model`, where composability and 
extensability are important features. The definition of model components 
(`History`, `Axis`, `Kernel`) as subclasses of `BaseModel` allows us to 
make full use of this composability.

However, the biggest advantages of using Pydantic are when embedding 
models within larger workflows. Pydantic objects can be easily 
(de)serialized, making it simple and reproducible to read/write 
parameter sets, model components or entire models to disk. Any special 
packing/unpacking code is within the object itself, allowing 
standardized codes for passing parameters and results between data 
processing stages.

## Axis module

The axis module was created to address a few issues:

  1. Sinn v0.1 defined different incompatible axis concepts, for 
manipulating time arrays (`History`, simulations) and rescaling, 
rebinning data (`AxisData`, Bayesian inference, data analysis).
  2. The time axis was fully enmeshed in the `History` class. It started 
as a simple second array, but grew into multiple specialized methods as 
needs arose.
  3. Conversions between time axes of different histories was 
unintuitive and error-prone.
  4. Being able to index a history by value instead of index was really 
useful: it allows to index different histories consistently without 
worrying about padding, and is convenient in interactive analysis 
sessions. However conversion to and from values brings a whole host of 
numerical rounding issues in simulation code. (In interactive sessions 
these weren't too important.)

The *axis.py* module collects all the functionality for time and data 
axes into a single consistent class. The more focussed module allowed 
for more precisely defined behaviour, and lightened the `History` class 
which had grown too large. The collection of pre-existing functionality 
represents maybe 20% of the new *axis.py* module.

The rest of the 80% adds the important new functionality of `AxisIndex` 
classes. Axis indices are structured in a hierarchy of types: similar to 
how Python's :mod:`datetime` defines *time* and *timedelta* objects, we 
have *axis* and *axisdelta* objects. We also distinguish between numeric 
and symbolic indices, a requirement for the redesigned Theano 
compilation algorithm. Finally, each axis defines its own index types 
((absolute, delta) x (numeric, symbolic)) – this allows histories to 
transparently convert indices as required, solving point 3 above.

The complex relationships between all axis index types is captured with 
an abstract class hierarchy, enabling easy checking of index type at any 
level of granularity (any delta index type vs any index type 
corresponding to *this* axis).

The concepts of “axis index” and “data index” are now clearly separated. 
For an axis index, there is no wrapping: the position at `-1` is simply 
more to the left than `0`, provided it exists. This allowed us to avoid 
the issues with numerical errors, without losing the convenience of 
value indexing in interactive sessions.

## Consistent automatic compilation

Version 0.1 of sinn saw multiple iterations of algorithms for 
constructing the Theano functions declaratively defined in models. This 
culminated in the pattern of “anchored computational graphs“ and 
connected/disconnected histories. In version 0.2, all obsolete 
compilation code was removed, and the pattern made coherent throughout. 
This lead to massive simplifications to the code underlying 
`Model.advance()`, and huge improvements in the reliability of that 
code.

## Unit tests

All new code is now covered by unit tests. This both massively improves 
reliability, and provides for a basic form of documentation of the API.

## Documentation

Documentation is now a an incoherent set of pages written with varying 
levels of completeness. This is a big improvement over the previous 
state of no documentation to speak of.
  • Loading branch information
alcrene committed Jun 15, 2020
1 parent a00f850 commit 2a6c072
Show file tree
Hide file tree
Showing 68 changed files with 16,595 additions and 8,194 deletions.
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,21 @@ debuglog*
data
TMPNOTES
venv
snippets
tmp*
*.log*
*.cprof
_build
_autosummary
*.coverage
*.html
*.cache
*.compilecache
*???final?.dat
*???final?.dir
# Theano graph cache files
activate
# symbolic link to environment
*.Rproj*
.Rprofile
.RData
76 changes: 64 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,38 @@
# Sinn

*sinn* is a library for both *S*imulation and *IN*ference of dynamical systems. It seeks to provide a flexible framework for building complex mathematical models which are fully compatible with machine learning libraries, allowing them to be differentiated through and compiled to C code.
*sinn* is a library for both *S*imulation and *IN*ference of dynamical systems. It provides a flexible framework for building complex mathematical models which are fully compatible with machine learning libraries, allowing almost arbitrary cost functions to be differentiated all the way back to the model parameters. Both the simulation model and cost function are compiled and run as C code.

## Motivation

Optimization frameworks like Theano and TensorFlow provide powerful capabilities for fitting models to data. However, they are most tailored to fitting neural networks implementing the type of dynamic mechanistic models often found in physics or applied mathematics within these frameworks remains error-prone and labour-intensive. Sinn provides a set of high-level constructs designed to integrate well with lower-level optimization libraries.
Optimization frameworks like TensorFlow and PyTorch provide powerful capabilities for fitting models to data. However, they are most tailored to fitting neural networks, and implementing the type of dynamic mechanistic models often found in physics or applied mathematics within these frameworks remains error-prone and labour-intensive. *sinn* provides a set of high-level constructs designed to bridge the gap between the mathematical language of dynamical systems and the interface of a machine-learning library.

Sinn was originally developed in order to infer a mesoscopic neuron model (René et al., in press, Neural Computation; [arXiv](https://arxiv.org/abs/1910.01618)).

## Documentation

Partial documentation can be found on [Read the Docs](https://sinn.readthedocs.io/en/latest/). This will improve as development continues.

## Features

- Automatic differentiation and C-compilation provided by Theano.
- Automatic differentiation and C-compilation provided by Theano, although one [could](#dependence-on-a-machine-learning-library) in theory use other use other machine-learning frameworks.

- Compatible with PyMC3

Make your model probabilistic with a few extra lines of code, for easy implementation of Bayesian inference and Monte Carlo sampling.

- Use optimization library only when desired.
- Use the optimization library only when desired.

No code change is required to run models with either Numpy or Theano – the single line `shim.load('theano')` suffices to load the optimization library.
Since a pure Numpy model does not require compilation every time it is run, allowing you to first develop your model faster with more easily traceable errors, and then benefit from the C-acceleration and automatic differentiation by loading the optimization library.
Since a pure Numpy model does not require compilation every time it is run, this allows you to first develop your model faster with more easily traceable errors, and then benefit from the C-acceleration and automatic differentiation by loading the optimization library.

- Data structures which map naturally to the mathematical models

+ `Axis`: Unit-aware structure for continuous quantities such as time, space, temperature…
+ `DataAxes`: combining *n*-dimensional data with *n* axes.
A development goal is to allow easier translation to Pandas' analogous `DataFrame` (the main difference between a frame and an axis being that the latter is continuous by design)
+ `History`: A `DataAxes` instance where one axis is time.

**Note** This organization of `Axis`, `DataAxes` and `History` is still incomplete work in progress and subject to change.
+ `DynArray`: combining *n*-dimensional data with *n* axes.
A development goal is to allow easier translation to PyData's analogous `DataArray` (the main difference being that a `DynArray` is intended for data generation, and is associated to a function which can fill the entries *dynamically*).
+ `History`: A `DynArray` instance where one axis is time.

**Note** This organization of `Axis`, `DynArray` and `History` is still work in progress and subject to change.

- Dynamic programming, aka lazy evaluation.

Expand All @@ -40,12 +43,49 @@ Sinn was originally developed in order to infer a mesoscopic neuron model (René

and then compute either *x* or *y* at any point *k<sup>\*</sup>*, without worrying¹ about the fact that f<sub>x</sub> and f<sub>y</sub> both depend on the arrays *x* and *y*, and without unnecessary calculations for points beyond *k<sup>\*</sup>*.

- Fully serializable models
Models are implemented as [Pydantic](https://pydantic-docs.helpmanual.io) models, and can be easily exported as dictionaries or serialized JSON:

mymodel.dict()
mymodel.json()

This is especially useful for repeating part of a workflow with different models or parameters, or deploying a model to a remote machine. Archiving the exact parameterization of a model is also a key component in a reproducible science workflow.

¹ Unless of course there is a circular dependency between f<sub>x</sub> and f<sub>y</sub>.

## Dependence on a machine-learning library

The choice of Theano as an underlying ML library is largely historical, and while I would likely make a different choice today, I currently have no plans to change this because:

- It works ;-)
- I like the functional style of Theano. It is more “math-like” then the declarative and imperative approaches used by TensorFlow and PyTorch respectively. My (absolutely untested) assumption is that in theory this makes fewer corner case with regard to automatic differentation, and I care more about robust differentiation than easy specification of for loops. I suspect this is an observation shared by the [JaX](https://jax.readthedocs.io) development team.
- Static graph optimization at least in theory makes for faster executation.
- Everything Theano related is routed through the [theano_shim](https://github.com/mackelab/theano_shim) backend, which removes much of the pain of debugging Theano code.

This last point also means that to use a different ML framework one would only need to port `theano_shim`. Most of the code in the backend is of the form

import theano.tensor as tt
def exp(x):
if symbolic(x):
return tt.exp
else:
return np.exp

(The required changes for a TensorFlow-compatible function are left as an exercise to the reader ;-). ) While not a negligible undertaking, porting `theano_shim` is thus certainly feasible.

## Development status

At present Sinn is at a pre-alpha stage of development. Some of the core functionality is being reorganized to reduce code duplication and make model specification more intuitive. These changes should stabilize the API, but for now one should expect to run into the occasional bug.
In any case users should treat this library as any fallible tool and check that it performs as expected in their situation.
At present *sinn* is at a pre-alpha stage of development. Version 0.2 should settle the core API (everything related to the `History` and `Model` classes), but less mature elements may still see some changes.

### v0.2dev release

The current version is a near-complete rewrite of the library, with focus on eliminating stale and duplicated code, more natural model definitions, proper unit testing, and simpler integration into larger workflows. In particular, the use of *Pydantic* throughout means that model objects and parameters can be trivially saved and loaded from disk. Although some [planned changes](https://github.com/mackelab/sinn/issues/1) are still work in progress, the update of the core functionality is complete and already better tested than in the previous version.

These changes were motivated by my own frustrations with v0.1, with regard to managing large numbers of simulations and workflows with multiple steps.

### Disclaimer

Although *sinn* tries hard to protect users from their own mistakes, users should still treat it as any fallible tool and check that it performs as expected in their situation.

## Installation

Expand All @@ -72,5 +112,17 @@ In any case users should treat this library as any fallible tool and check that

As usual, if you want to be able to modify the code, add the `-e` flag to obtain a development installation.

## Running tests

Install the development packages

pip install sinn[dev]

This will install `pytest` and `pytest-xdist`. Now run the test suite as

pytest --forked

The `--forked` option ensures that each test is run in a separate process. This is required to fully test *sinn* both with and without the auto-diff library loaded.


Copyright (c) 2017-2020 Alexandre René
154 changes: 154 additions & 0 deletions docs/axis.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
*********
Data axes
*********

The `Axis` objects serve as the link between physical and digital dimensions.
They are comparable to the `dims` and `coords` attributes
of a PyData_ :class:`xarray.DataArray`, and similarly form the metadata
frame around a data object. However, in contrast to the PyData objects, they
may or may not be underlined by a NumPy array, and have specialized functions
for index arithmetic.

.. _PyData: http://xarray.pydata.org/en/stable/data-structures.html

Definitions
===========

.. glossary::

Axis
Physical axis, generally with associated units.

Data index
*Index*, which is shifted such that its smallest value is always zero.
This is the value to use to index the underlying data.

Digital dimension
A dimension consistent with the computer's storage; the basic example is
an axis of a software array.

Index
Integer value representing a position along a digital dimension. This may
differ from the *data index* if an axis includes padding.

Index delta
Difference between two indices. / Relative index.

Mapping
Strictly-positive monotone function mapping an integer in digital space to
a stop in physical space.

Padding
Extra stops added before (after) the *x:sub:`0`* (*x:sub:`n`*) position of a digital axis.
These may be used e.g. to initialize a dynamical system, or allow the
calculation of convolutions at axis bounds.
Adding left-padding will not affect the *index*, but **will** affect the
*data index* (by shifting it by an amount equal to the added padding).

Stops
Physical values on an axis (one may think of ticks on a plot).

x:sub:`0`
Position corresponding to the start of a discretized axis. If the axis
corresponds to time, this would usually be the start of the simulation.

x:sub:`n`
Position corresponding to the end of a discretized axis. If the axis
corresponds to time, this would usually be the end of the simulation.

.. _sinn-indexing:

Indexing
========

Indexing in `sinn` works a little differently than what you may expect (in fact, that is the reason why in most cases, it *looks* like normal indexing). The first thing to note is that `sinn` does not recognize negative indices (or at least not the way you may expect them to). This is because a core task of `sinn` is to do index arithmetic for us, and it is *way* too easy to accidentally calculate a negative index (for example by trying to retrieve the time point before the origin). After one-too-many times of shooting myself in the foot, I realized that the small convenience of negative indexing (``axis[-i]`` vs. ``axis[axis.xnidx - i]``) was simply more troubled then it was worth and disallowed it entirely.

Later I realized the reason for these issues: that I was using “index” to refer to two related, but conceptually different things: a “data index” and an “axis index”. A *data* index is tied to a structure in computer memory; index ``0`` is always the first element of this structure, and since it has fixed length ``L``, mapping ``-i`` to ``L-i`` is unambiguous. An *axis* index, on the other hand, is simply a position on an arbitrarily defined discretization of that axis. Index `0` is most naturally associated with the origin, *but this need not be the first discretized position*. Correspondingly, a negative index should then be associated with a position left of the origin.

The thing to remember is this: because the structures defined by `sinn` are mathematical structure, *so too is indexing done in axis space*. This includes the interpretation of negative indices as being “more to the left” than 0. The translation into data indices is done internally and should be transparent. As an added bonus, this avoids hard-to-optimize conditionals in the computational graph.

Throughout this library, a “position along a discretized axis” is referred to as a “stop”.

.. Note:: Always use axis indices when passing as arguments between functions or methods; this includes internal functions. Resolve to data indices only when indexing the underlying data. (Remember that axis indices are also monotone and 1-to-1 mapped to data indices, and so just as informational as data indices.)

Operations between indices
==========================

TODO

.. Note::
Operations with plain integers are allowed, but since it is impossible to know whether a plain integer is an absolute or a relative index, they are necessarily ambiguous. The convention in sinn is to treat integers as absolute when *indexing* and testing *equality*, but as relative (deltas) when performing *operations*. This allows ``hist[5]``, ``hist.cur_tidx == 5`` and ``hist[i+1]`` to work as generally expected.

If this convention is ill-suited, or to remove any ambiguity, use ``hist.Index`` or ``hist.Index.Delta`` to cast the value to the appropriate::

hist[hist.Index(5)]
hist.cur_tidx == hist.Index(5)
hist[i + hist.Index.Delta(1)]

Main classes
============

Axis types
----------

`Axis`
Base class for all axis, which represent a *physical* axis. Contains methods
for converting/checking physical units and converting to a transformed
space (if the axis was defined with a bijection).
Does **not** include any indexing functionality; this is a purely
“physical” concept.

`DiscretizedAxis` (Virtual)
Subclass of `Axis` which adds indexing. This is a virtual class because
it does not specify the structure used for indexing.

`MapAxis`
A `DiscretizedAxis` where the indexing is provided by `SequenceMapping`.

`RangeAxis`
A `DiscretizedAxis` where the indexing is provided by `RangeMapping`.

`ArrayAxis`
A `DiscretizedAxis` where the indexing is provided by `ArrayMapping`.

Mapping types
-------------

`SequenceMapping`
An object implementing an arbitrary mapping between indices and stops.
Is responsible for basic index arithmetic, including computing padding.
A `SequenceMapping` is iterable and supports indexed access.

`RangeMapping`
A `SequenceMapping` where the indexing is implemented by a memory-efficient
`range` object and the stops are spaced regularly. Stops are computed with
index arithmetic, in the same way as `range`.

`ArrayMapping`
A `SequenceMapping` where the stops are simply stored as an array.

.. Note:: Part of the goal of the *padding* constructs is to make padding as invisible as possible. Consequently, when indexing a `*Mapping` with square brackets, one uses the *unpadded* index. (With regards to the definitions above, brackets use the *index* rather than the *data index*.)

.. Warning:: Negative indices are explicitly disallowed on :py:class:`*Mapping` (or more specifically, they are just treated as "more left" than zero). With large amounts of index arithmetic, I found it too easy to accidentally obtain negative indices, and the hard-to-track bugs they introduce don't justify the minor convenience they bring. As a bonus, negative indices are sometimes useful to add padding stops without changing the position of '0'.

Index types
-----------

`AxisIndex` and `AxisIndexDelta` are very similar, but are treated
somewhat differently in operations. For
example, instances of `AxisIndex` cannot be added together, while instances
of `AxisIndexDelta` can. (c.f. operations table).

One should not normally create these classes directly.

`AxisIndex`
Absolute index. A new `AxisIndex` *class* is created dynamically every time
an `*Mapping` object is instantiated. This allows operations between indices
to check whether they refer to the same `Axis`.
The index reference is *x:sub:`0`*; to index into the data, use the
`.data_index` method to correct for padding.
Technically a subclass of `AxisIndexDelta`.

`AxisIndexDelta`
Relative index. Created alongside `AxisIndex` when a `*Mapping` object is
instantiated.
Loading

0 comments on commit 2a6c072

Please sign in to comment.