From c399ec84cb4c858a15af871a28c546a20a0ac3b1 Mon Sep 17 00:00:00 2001 From: smathot Date: Mon, 19 Dec 2022 14:10:15 +0100 Subject: [PATCH] Update docs --- doc-pelican/content/pages/basic.md | 601 ++++++++---------- doc-pelican/content/pages/index.md | 8 +- doc-pelican/content/pages/largedata.md | 1 + doc-pelican/sitemap.yaml | 1 + .../cogsci/templates/mega-menu-content.html | 1 + doc-pelican/themes/cogsci/templates/page.html | 7 + 6 files changed, 285 insertions(+), 334 deletions(-) create mode 100644 doc-pelican/content/pages/largedata.md diff --git a/doc-pelican/content/pages/basic.md b/doc-pelican/content/pages/basic.md index 3e9f608..a45e499 100644 --- a/doc-pelican/content/pages/basic.md +++ b/doc-pelican/content/pages/basic.md @@ -57,71 +57,91 @@ Slightly longer cheat sheet: [TOC] -## Basic operations - -### Creating a DataMatrix - -Create a new `DataMatrix` object, and add a column (named `col`). By default, the column is of the `MixedColumn` type, which can store numeric and string data. +## Creating a DataMatrix +Create a new `DataMatrix` object with a length (number of rows) of 2, and add a column (named `col`). By default, the column is of the `MixedColumn` type, which can store numeric, string, and `None` data. ```python import sys from datamatrix import DataMatrix, __version__ dm = DataMatrix(length=2) -dm.col = ':-)' -print( - 'Examples generated with DataMatrix v{} on Python {}\n'.format( - __version__, - sys.version - ) -) +dm.col = '☺' +print('DataMatrix v{} on Python {}\n'.format(__version__, sys.version)) print(dm) ``` -You can change the length of the `DataMatrix` later on. If you reduce the length, data will be lost. If you increase the length, empty cells will be added. - +You can change the length of the `DataMatrix` later on. If you reduce the length, data will be lost. If you increase the length, empty cells (by default containing empty strings) will be added. ```python dm.length = 3 ``` -### Concatenating two DataMatrix objects +## Reading and writing files -You can concatenate two `DataMatrix` objects using the `<<` operator. Matching columns will be combined. (Note that row 2 is empty. This is because we have increased the length of `dm` in the previous step, causing an empty row to be added.) +You can read and write files with functions from the `datamatrix.io` module. The main supported file types are `csv` and `xlsx`. + +```python +from datamatrix import io + +dm = DataMatrix(length=3) +dm.col = 1, 2, 3 +# Write to disk +io.writetxt(dm, 'my_datamatrix.csv') +io.writexlsx(dm, 'my_datamatrix.xlsx') +# And read it back from disk! +dm = io.readtxt('my_datamatrix.csv') +dm = io.readxlsx('my_datamatrix.xlsx') +``` + +Multidimensional columns cannot be saved to `csv` or `xlsx` format but instead need to be saved to a custom binary format. + +``` +from datamatrix import MultiDimensionalColumn +dm.mdim_col = MultiDimensionalColumn(shape=2) +# Write to disk +io.writebin(dm, 'my_datamatrix.dm') +# And read it back from disk! +dm = io.readbin('my_datamatrix.dm') +``` + + +## Stacking (vertically concatenating) DataMatrix objects + +You can stack two `DataMatrix` objects using the `<<` operator. Matching columns will be combined. (Note that row 2 is empty. This is because we have increased the length of `dm` in the previous step, causing an empty row to be added.) ```python dm2 = DataMatrix(length=2) -dm2.col = ';-)' +dm2.col = '☺' dm2.col2 = 10, 20 dm3 = dm << dm2 print(dm3) ``` - -### Creating columns - -You can change all cells in column to a single value. This creates a new column if it doesn't exist yet. - +Pro-tip: To stack three or more `DataMatrix` objects, using [the `stack()` function from the `operations` module](%url:operations) is faster than iteratively using the `<<` operator. ```python -dm.col = 'Another value' -print(dm) +from datamatrix import operations as ops +dm4 = ops.stack(dm, dm2, dm3) ``` -You can change all cells in a column based on a sequence. This creates a new column if it doesn't exist yet. This sequence must have the same length as the column (3 in this case). +## Working with columns +### Referring to columns + +You can refer to columns in two ways: as keys in a `dict` or as properties. The two notations are identical for most purposes. The main reason to use a `dict` style is when the name of the column is itself variable. Otherwise, the property style is recommended for clarity. ```python -dm.col = 1, 2, 3 -print(dm) +dm['col'] # dict style +dm.col # property style ``` -If you do not know the name of a column, for example because it is defined by a variable, you can also refer to columns as though they are items of a `dict`. However, this is *not* recommended, because it makes it less clear whether you are referring to column or a row. +### Creating columns +By assigning a value to a non-existing colum, a new column is created and initialized to this value. ```python -dm['col'] = 'X' +dm.col = 'Another value' print(dm) ``` @@ -145,53 +165,56 @@ del dm.col2 print(dm) ``` -### Slicing and assigning to column cells +### Column types + +There are five column types: -#### Assign to one cell +- `MixedColumn` is the default column type. This can contain numbers (`int` and `float`), strings (`str`), and `None` values. This column type is flexible but not very fast because it is (mostly) implemented in pure Python, rather than using `numpy`, which is the basis for the other columns. The default value for empty cells is an empty string. +- `FloatColumn` contains `float` numbers. The default value for empty cells is `NAN`. +- `IntColumn` contains `int` numbers. (This does not include `INF`, and `NAN`, which are of type `float` in Python.) The default value for empty cells is 0. +- `MultiDimensionalColumn` contains higher-dimensional `float` arrays. This allows you to mix higher-dimensional data, such as time series or images, with regular one-dimensional data. The default value for empty cells is `NAN`. +- `SeriesColumn` is identical to a two-dimensional `MultiDimensionalColumn`. +When you create a `DataMatrix`, you can indicate a default column type. ```python -dm.col[1] = ':-)' -print(dm) +# Create IntColumns by default +dm = DataMatrix(length=2, default_col_type=int) +dm.i = 1, 2 # This is an IntColumn ``` -#### Assign to multiple cells - -This changes row 0 and 2. It is not a slice! - +You can also explicitly indicate the column type when creating a new column. ```python -dm.col[0,2] = ':P' -print(dm) +dm.f = float # This creates an empty (`NAN`-filled) FloatColumn +dm.i = int # This creates an empty (0-filled) IntColumn ``` -#### Assign to a slice of cells - +To create a `MultiDimensionalColumn` you need to import the column type and specify a shape: ```python -dm.col[1:] = ':D' +from datamatrix import MultiDimensionalColumn +dm.mdim_col = MultiDimensionalColumn(shape=(2, 3)) print(dm) ``` -#### Assign to cells that match a selection criterion - +You can also specify named dimensions. For example, `('x', 'y')` creates a dimension of size 2 where index 0 can be referred to as 'x' and index 1 can be referred to as 'y': ```python -dm.col[1:] = ':D' -dm.is_happy = 'no' -dm.is_happy[dm.col == ':D'] = 'yes' -print(dm) +dm.mdim_col = MultiDimensionalColumn(shape=(('x', 'y'), 3)) ``` + ### Column properties -Basic numeric properties, such as the mean, can be accessed directly. Only numeric values are taken into account. +Basic numerical properties, such as the mean, can be accessed directly. For this purpose, only numerical, non-`NAN` values are taken into account. ```python +dm = DataMatrix(length=3) dm.col = 1, 2, 'not a number' # Numeric descriptives -print('mean: %s' % dm.col.mean) +print('mean: %s' % dm.col.mean) # or dm.col[...] print('median: %s' % dm.col.median) print('standard deviation: %s' % dm.col.std) print('sum: %s' % dm.col.sum) @@ -203,54 +226,161 @@ print('number of unique values: %s' % dm.col.count) print('column name: %s' % dm.col.name) ``` -### Iterating over rows, columns, and cells +The `shape` property indicates the number and sizes of the dimensions of the column. For regular columns, the shape is a tuple containing only the length of the datamatrix (the number of rows). For multidimensional columns, the shape is a tuple containing the length of the datamatrix and the shape of cells as specified through the `shape` keyword. -By iterating directly over a `DataMatrix` object, you get successive `Row` objects. From a `Row` object, you can directly access cells. +```python +print(dm.col.shape) +dm.mdim_col = MultiDimensionalColumn(shape=(2, 4)) +print(dm.mdim_col.shape) +``` +The `loaded` property indicates whether a column is currently stored in memory, or whether it is offloaded to disk. This is mainly relevant for multidimensional columns, which are [automatically offloaded to disk when memory runs low](%link:largedata%). ```python -dm.col = 'a', 'b', 'c' -for row in dm: - print(row) - print(row.col) +print(dm.mdim_col.loaded) ``` -By iterating over `DataMatrix.columns`, you get successive `(column_name, column)` tuples. +## Assigning + +### Assigning by index, multiple indices, or slice + +You can assign a single value to one or more cells in various ways. ```python -for colname, col in dm.columns: - print('%s = %s' % (colname, col)) +dm = DataMatrix(length=4) +# Create a new columm +dm.col = '' +# By index: assign to a single cell (at row 1) +dm.col[1] = ':-)' +# By a tuple (or other iterable) of multiple indices: +# assign to cells at rows 0 and 2 +dm.col[0, 2] = ':P' +# By slice: assign from row 1 until the end +dm.col[2:] = ':D' +print(dm) ``` -By iterating over a column, you get successive cells: +You can also assign multiple values at once, provided that the to-be-assigned sequence is of the correct length. +```python +# Assign to the full column +dm.col = 1, 2, 3, 4 +# Assign to two cells +dm.col[0, 2] = 'a', 'b' +print(dm) +``` + + +### Assigning to cells that match a selection criterion + +As will be described in more detail later on, comparing a column to a value gives a new `DataMatrix` that contains only the matching rows. This subsetted `DataMatrix` can in turn be used to assign to the matching rows of the original `DataMatrix`. This sounds a bit abstract but is very easy in practice: ```python -for cell in dm.col: - print(cell) +dm.col[1:] = ':D' +dm.is_happy = 'no' +dm.is_happy[dm.col == ':D'] = 'yes' +print(dm) ``` -By iterating over a `Row` object, you get (`column_name, cell`) tuples: +### Assigning to multidimensional columns + +Assigning to multidimensional columns works much the same as assigning to regular columns. The main differences are that there are multiple dimensions, and that dimensions can be named. ```python -row = dm[0] # Get the first row -for colname, cell in row: - print('%s = %s' % (colname, cell)) +dm = DataMatrix(length=2) +dm.mdim_col = MultiDimensionalColumn(shape=(('x', 'y'), 3)) +# Set all values to a single value +dm.mdim_col = 1 +# Set all last dimensions to a single array of shape 3 +dm.mdim_col = [ 1, 2, 3] +# Set all rows to a single array of shape (2, 3) +dm.mdim_col = [[ 1, 2, 3], + [ 4, 5, 6]] +# Set the column to an array of shape (2, 3, 3) +dm.mdim_col = [[[ 1, 2, 3], + [ 4, 5, 6]], + [[ 7, 8, 9], + [10, 11, 12]]] ``` -The `column_names` property gives a sorted list of all column names (without the corresponding column objects): +To assign to dimensions by name: + +```python +dm.mdim_col[:, 'x'] = 1, 2, 3 # identical to assigning to dm.mdim_col[:, 0] +dm.mdim_col[:, 'y'] = 4, 5, 6 # identical to assigning to dm.mdim_col[:, 1] +``` + +*Pro-tip:* When assigning an array-like object to a multidimensional column, the shape of the to-be-assigned array needs to match the final part of the shape of the column. This means that you can assign a (2, 3) array to a (2, 2, 3) column in which case all rows (the first dimension) are set to the array. shape However, you *cannot* assign a (2, 2) array to a (2, 2, 3) column. +## Accessing + +### Accessing by index, multiple indices, or slice ```python -print(dm.column_names) +dm = DataMatrix(length=4) +# Create a new column +dm.col = 'a', 'b', 'c', 'd' +# By index: select a single cell (at row 1). +print(dm.col[1]) +# By a tuple (or other iterable) of multiple indices: +# select cells at rows 0 and 2. This gives a new column. +print(dm.col[0, 2]) +# By slice: assign from row 1 until the end. This gives a new column. +print(dm.col[2:]) +``` + + +### Accessing and averaging (ellipsis averaging) multidimensional columns + +Accessing multidimensional columns works much the same as accessing regular columns. The main differences are that there are multiple dimensions, and that dimensions can be named. + +```python +dm = DataMatrix(length=2) +dm.mdim_col = MultiDimensionalColumn(shape=(('x', 'y'), 3)) +dm.mdim_col = [[[ 1, 2, 3], + [ 4, 5, 6]], + [[ 7, 8, 9], + [10, 11, 12]]] +# From all rows, get index 1 (named 'y') from the second dimension and index 2 from the third dimension. +print(dm.mdim_col[:, 'y', 2]) +``` + +You can select the average of a column using the ellipsis (`...`) index. For regular columns, this is indentical to accessing the `mean` property: + +```python +dm.col = 1, 2 +print(dm.col[...]) # identical to `dm.col.mean` +``` + +Ellipsis averaging (`...`) is especially useful when working with multidimensional data, in which case it allows you to average over specific dimensions. As long as you don't average over the first dimension, which corresponds to the rows of the `DataMatrix`, the result is a new column. + + +```python + +# Averaging gover the third dimension gives a column of shape (2, 2) +dm.avg3 = dm.mdim_col[:, :, ...] +# Average over the second dimension gives a colum of shape (2, 3) +dm.avg2 = dm.mdim_col[:, ...] +# Averaging over the second and third dimensions gives a `FloatColumn`. +dm.avg23 = dm.mdim_col[:, ..., ...] +print(dm) +``` + +When averaging over the first dimension, which corresponds to the rows of the `DataMatrix`, the result is either an array or (if all dimensions are averaged) a float: + +```python +# Averaging over the rows gives an array of shape (2, 3) +print(dm.mdim_col[...]) +# Averaging over all dimensions gives a float +print(dm.mdim_col[..., ..., ...]) ``` -### Selecting data +## Selecting -#### Comparing a column to a value +### Selecting by column values You can select by directly comparing columns to values. This returns a new `DataMatrix` object with only the selected rows. @@ -262,7 +392,7 @@ dm_subset = dm.col > 5 print(dm_subset) ``` -#### Selecting by multiple criteria with `|` (or), `&` (and), and `^` (xor) +### Selecting by multiple criteria with `|` (or), `&` (and), and `^` (xor) You can select by multiple criteria using the `|` (or), `&` (and), and `^` (xor) operators (but not the actual words 'and' and 'or'). Note the parentheses, which are necessary because `|`, `&`, and `^` have priority over other operators. @@ -278,7 +408,7 @@ dm_subset = (dm.col > 1) & (dm.col < 8) print(dm_subset) ``` -#### Selecting by multiple criteria by comparing to a set `{}` +### Selecting by multiple criteria by comparing to a set `{}` If you want to check whether column values are identical to, or different from, a set of test values, you can compare the column to a `set` object. (This is considerably faster than comparing the column values to each of the test values separately, and then merging the result using `&` or `|`.) @@ -288,7 +418,7 @@ dm_subset = dm.col == {1, 3, 5, 7} print(dm_subset) ``` -#### Selecting with a function or lambda expression +### Selecting (filtering) with a function or lambda expression You can also use a function or `lambda` expression to select column values. The function must take a single argument and its return value determines whether the column value is selected. This is analogous to the classic `filter()` function. @@ -298,7 +428,7 @@ dm_subset = dm.col == (lambda x: x % 2) print(dm_subset) ``` -#### Selecting values that match another column (or sequence) +### Selecting values that match another column (or sequence) You can also select by comparing a column to a sequence, in which case a row-by-row comparison is done. This requires that the sequence has the same length as the column, is not a `set` object (because `set` objects are treated as described above). @@ -310,7 +440,9 @@ dm_subset = dm.col == ['a', 'b', 'x', 'y'] print(dm_subset) ``` -When a column contains values of different types, you can also select values by type: (Note: On Python 2, all `str` values are automatically decoded to `unicode`, so you'd need to compare the column to `unicode` to extract `str` values.) +### Selecting values by type + +When a column contains values of different types, you can also select values by type: ```python @@ -320,40 +452,50 @@ dm_subset = dm.col == int print(dm_subset) ``` -#### Getting indices for rows that match selection criteria ('where') +### Getting indices for rows that match selection criteria ('where') You can get the indices for rows that match certain selection criteria by slicing a `DataMatrix` with a subset of itself. This is similar to the `numpy.where()` function. ```python dm = DataMatrix(length=4) dm.col = 1, 2, 3, 4 -print(dm[(dm.col > 1) & (dm.col < 4)]) +indices = dm[(dm.col > 1) & (dm.col < 4)] +print(indices) ``` -### Element-wise column operations +### Selecting a subset of columns -#### Multiplication, addition, etc. +You can select a subset of columns by passing the columns as an index to `dm[]`. Columns can be specified by name ('col3') or by object (`dm.col1`). -You can apply basic mathematical operations on all cells in a column simultaneously. Cells with non-numeric values are ignored, except by the `+` operator, which then results in concatenation. +```python +dm = DataMatrix(length=4) +dm.col1 = '☺' +dm.col2 = 'a' +dm.col3 = 1 +dm_subset = dm[dm.col1, 'col3'] +print(dm_subset) +``` + + +## Element-wise column operations +### Multiplication, addition, etc. + +You can apply basic mathematical operations on all cells in a column simultaneously. Cells with non-numeric values are ignored, except by the `+` operator, which then results in concatenation. ```python dm = DataMatrix(length=3) dm.col = 0, 'a', 20 -dm.col2 = dm.col*.5 -dm.col3 = dm.col+10 -dm.col4 = dm.col-10 -dm.col5 = dm.col/50 +dm.col2 = dm.col * .5 +dm.col3 = dm.col + 10 +dm.col4 = dm.col - 10 +dm.col5 = dm.col / 50 print(dm) ``` -#### Applying a function or lambda expression - -
-The @ operator is only available in Python 3.5 and later. -
+### Applying (mapping) a function or lambda expression -You can apply a function or `lambda` expression to all cells in a column simultaneously with the `@` operator. +You can apply a function or `lambda` expression to all cells in a column simultaneously with the `@` operator. This analogous to the classic `map()` function. ```python @@ -363,298 +505,95 @@ dm.col2 = dm.col @ (lambda x: x*2) print(dm) ``` -## Reading and writing files - -You can read and write files with functions from the `datamatrix.io` module. The main supported file types are `csv` and `xlsx`. - -```python -from datamatrix import io +## Iterating over rows, columns, and cells (for loops) -dm = DataMatrix(length=3) -dm.col = 1, 2, 3 -# Write to disk -io.writetxt(dm, 'my_datamatrix.csv') -io.writexlsx(dm, 'my_datamatrix.xlsx') -# And read it back from disk! -dm = io.readtxt('my_datamatrix.csv') -dm = io.readxlsx('my_datamatrix.xlsx') -``` - -## Column types +By iterating directly over a `DataMatrix` object, you get successive `Row` objects. From a `Row` object, you can directly access cells. -When you create a `DataMatrix`, you can indicate a default column. If you do not specify a default column type, a `MixedColumn` is used by default. ```python -dm = DataMatrix(length=2, default_col_type=int) -dm.i = 1, 2 # This is an IntColumn +dm.col = 'a', 'b', 'c' +for row in dm: + print(row) + print(row.col) ``` -You can also explicitly indicate the column type when creating a new column. +By iterating over `DataMatrix.columns`, you get successive `(column_name, column)` tuples. + ```python -dm.f = float # This creates a FloatColumn -``` - -### MixedColumn (default) - -A `MixedColumn` contains text (`unicode` in Python 2, `str` in Python 3), `int`, `float`, or `None`. - -Important notes: - -- `utf-8` encoding is assumed for byte strings -- String with numeric values, including `NAN` and `INF`, are automatically converted to the most appropriate type -- The string 'None' is *not* converted to the type `None` -- Trying to assign a non-supported type results in a `TypeError` - -```python -from datamatrix import DataMatrix, NAN, INF -dm = DataMatrix(length=12) -dm.datatype = ( - 'int', - 'int (converted)', - 'float', - 'float (converted)', - 'None', - 'str', - 'float', - 'float (converted)', - 'float', - 'float (converted)', - 'float', - 'float (converted)', -) -dm.value = ( - 1, - '1', - 1.2, - '1.2', - None, - 'None', - NAN, - 'nan', - INF, - 'inf', - -INF, - '-inf' -) -print(dm) +for colname, col in dm.columns: + print('%s = %s' % (colname, col)) ``` - -### IntColumn (requires numpy) - -The `IntColumn` contains only `int` values. As of 0.14, the easiest way to create a `IntColumn` column is to assign `int` to a new column name. - -Important notes: - -- Trying to assign a value that cannot be converted to an `int` results in a `TypeError` -- Float values will be rounded down (i.e. the decimals will be lost) -- `NAN` or `INF` values are not supported because these are `float` +By iterating over a column, you get successive cells: ```python -from datamatrix import DataMatrix -dm = DataMatrix(length=2) -dm.i = int -dm.i = 1, 2 -print(dm) +for cell in dm.col: + print(cell) ``` -If you insert non-`int` values, they are automatically converted to `int` if possible. Decimals are discarded (i.e. values are floored, not rounded): +By iterating over a `Row` object, you get (`column_name, cell`) tuples: ```python -dm.i = '3', 4.7 -print(dm) +row = dm[0] # Get the first row +for colname, cell in row: + print('%s = %s' % (colname, cell)) ``` -If you insert values that cannot converted to `int`, a `TypeError` is raised: +The `column_names` property gives a sorted list of all column names (without the corresponding column objects): ```python -try: - dm.i = 'x' -except TypeError as e: - print(repr(e)) +print(dm.column_names) ``` -### FloatColumn (requires numpy) - -The `FloatColumn` contains `float`, `nan`, and `inf` values. As of 0.14, the easiest way to create a `FloatColumn` column is to assign `float` to a new column name. +## Miscellanous notes -Important notes: +### Type conversion and character encoding -- Values that are accepted by a `MixedColumn` but cannot be converted to a numeric value become `NAN`. Examples are non-numeric strings or `None`. -- Trying to assign a non-supported type results in a `TypeError` +For `MixedColumn`: +- The strings 'nan', 'inf', and '-inf' are converted to the corresponding `float` values (`NAN`, `INF`, and `-INF`). +- Byte-string values (`bytes`) are automatically converted to `str` assuming `utf-8` encoding. +- Trying to assign an unsupported type results in a `TypeError`. +- The string 'None' is *not* converted to the type `None`. -```python -import numpy as np -from datamatrix import DataMatrix, FloatColumn -dm = DataMatrix(length=3) -dm.f = float -dm.f = 1, np.nan, np.inf -print(dm) -``` +For `FloatColumn`: -If you insert other values, they are automatically converted if possible. - - -```python -dm.f = '3.3', 'inf', 'nan' -print(dm) -``` +- The strings 'nan', 'inf', and '-inf' are converted to the corresponding `float` values (`NAN`, `INF`, and `-INF`). +- Unsupported types are converted to `NAN`. A warning is shown. -If you insert values that cannot be converted to `float`, they become `nan`. +For `IntColumn`: -```python -dm.f = 'x' -print(dm) -``` +- Trying to assign non-`int` values results in a `TypeError`. -
-Note: Careful when working with nan data! -
+### NAN and INF values You have to take special care when working with `nan` data. In general, `nan` is not equal to anything else, not even to itself: `nan != nan`. You can see this behavior when selecting data from a `FloatColumn` with `nan` values in it. ```python -from datamatrix import DataMatrix, FloatColumn +from datamatrix import DataMatrix, FloatColumn, NAN dm = DataMatrix(length=3) dm.f = FloatColumn -dm.f = 0, np.nan, 1 -dm = dm.f == [0, np.nan, 1] +dm.f = 0, NAN, 1 +dm = dm.f == [0, NAN, 1] print(dm) ``` However, for convenience, you can select all `nan` values by comparing a `FloatColumn` to a single `nan` value: ```python -from datamatrix import DataMatrix, FloatColumn dm = DataMatrix(length=3) dm.f = FloatColumn -dm.f = 0, np.nan, 1 +dm.f = 0, NAN, 1 +print(dm.f == NAN) print('NaN values') -print(dm.f == np.nan) print('Non-NaN values') -print(dm.f != np.nan) -``` - - -## Working with continuous data (requires numpy) - -To work with continous (or time-series) data, datamatrix provides the `SeriesColumn` class. In a series column, each cell is itself a series of values. - -A more elaborate tutorial on working with time series can be found here: - -- - -### Mixing two- and three-dimensional data - -With column-based or tabular data, every cell is defined by two coordinates: the column name, and the row number; that is, column-based data is two dimensional. But for many kinds of data, two dimensions is not enough. - -To illustrate this, let's imagine that you want to store the population of cities over a period of three years. You could do this by simply adding a column for every year, `population2008`, `population2009`, `population2010`: - - -```python -from datamatrix import DataMatrix - -# Not very elegant! -dm = DataMatrix(length=2) -dm.city = 'Marseille', 'Lyon' -dm.population2010 = 850726, 484344 -dm.population2009 = 850602, 479803 -dm.population2008 = 851420, 474946 -print(dm) -``` - -In this example, this naive approach is still feasible, because there are only three years, so you need only three columns. But imagine that you want to store the year-by-year population over several centuries. You would then end up with hundreds of columns! Not impossible, but not very elegant either. - -It would be much more elegant if you could have a single column for the population, and then give this column a third dimension (a *depth*) so that it can store the population over time. And that's where the `SeriesColumn` comes in. - - -```python -from datamatrix import DataMatrix, SeriesColumn - -# Pretty elegant, right? -dm = DataMatrix(length=2) -dm.city = 'Marseille', 'Lyon' -dm.population = SeriesColumn(depth=3) -dm.population[0] = 850726, 850602, 851420 # Marseille -dm.population[1] = 484344, 479803, 474946 # Lyon -dm.year = SeriesColumn(depth=3) -dm.year = 2010, 2009, 2008 -print(dm) -``` - - -### Basic properties of series - -Series columns have the same properties as regular columns: `mean`, `median`, `std`, `sum`, `min`, and `max`. But where these properties are single values for regular columns, they are one-dimensional numpy arrays for series columns. - - -```python -print(dm.population.mean) -``` - - -### Indexing - -#### Accessing - -The first dimension of a series dimension refers to the row. So to get the population of Marseille (row 0) over time, you can do: - - -```python -print(dm.population[0]) -``` - -The second dimension refers to the depth. So to get the population of both Marseille and Lyon in 2009 (the full slice `:`), you can do: - - -```python -print(dm.population[:, 1]) -``` - -#### Assigning - -You can assign to a series columns as you would to a 2D numpy array: - - -```python -dm = DataMatrix(length=2) -dm.s = SeriesColumn(depth=3) -dm.s[0, 0] = 1 -dm.s[1:, 1:] = 2 -print(dm) -``` - -If you want to set all cells at once, you can directly assign a single value: - - -```python -dm.s = 10 -print(dm) -``` - -If you want to set all rows at once, you can directly assign a sequence with a length that is equal to the depth of the series: - - -```python -dm.s = 100, 200, 300 -# Equal to: dm.s[:,:] = 100, 200, 300 -print(dm) -``` - -If you want to set all columns at once, you can directly assing a sequence with a length that is equal to the length of the datamatrix: - - -```python -dm.s = 1000, 2000 -# Equal to: dm.s[:,:] = 1000, 2000 -print(dm) +print(dm.f != NAN) ``` diff --git a/doc-pelican/content/pages/index.md b/doc-pelican/content/pages/index.md index c263be0..386c4eb 100644 --- a/doc-pelican/content/pages/index.md +++ b/doc-pelican/content/pages/index.md @@ -20,12 +20,14 @@ title: DataMatrix ## Features - [An intuitive syntax](%link:basic%) that makes your code easy to read +- Mix tabular data with [time series](%link:series%) and [multidimensional data](%link:multidimensional) in a single data structure +- Support for [large data](%link:largedata) by intelligent (and automatic) offloading of data to disk when memory is running low +- Advanced [memoization (caching)](%link:memoization%) - Requires only the Python standard libraries (but you can use `numpy` to improve performance) -- Great support for [functional programming](%link:functional%), including advanced [memoization (caching)](%link:memoization%) -- Mix [two-dimensional](%link:series%) (series) and one-dimensional data in a single data structure -- Compatible with your favorite tools for numeric computation: +- Compatible with your favorite data-science libraries: - `seaborn` and `matplotlib` for [plotting](https://pythontutorials.eu/numerical/plotting) - `scipy`, `statsmodels`, and `pingouin` for [statistics](https://pythontutorials.eu/numerical/statistics) + - `mne` for analysis of electroencephalographic (EEG) and magnetoencephalographic (MEG) data - [Convert](%link:convert%) to and from `pandas.DataFrame` - Looks pretty inside a Jupyter Notebook diff --git a/doc-pelican/content/pages/largedata.md b/doc-pelican/content/pages/largedata.md new file mode 100644 index 0000000..d8f890a --- /dev/null +++ b/doc-pelican/content/pages/largedata.md @@ -0,0 +1 @@ +title: Dynamic loading diff --git a/doc-pelican/sitemap.yaml b/doc-pelican/sitemap.yaml index 79405a0..d9f90a0 100644 --- a/doc-pelican/sitemap.yaml +++ b/doc-pelican/sitemap.yaml @@ -5,6 +5,7 @@ Tutorials: Statistics: https://pythontutorials.eu/numerical/statistics Working with series: https://pythontutorials.eu/numerical/time-series Memoization (caching): memoization + Working with large data: largedata Analyzing eye-movement data: eyelinkparser Modules: datamatrix.convert: convert diff --git a/doc-pelican/themes/cogsci/templates/mega-menu-content.html b/doc-pelican/themes/cogsci/templates/mega-menu-content.html index fc98a0c..1409383 100644 --- a/doc-pelican/themes/cogsci/templates/mega-menu-content.html +++ b/doc-pelican/themes/cogsci/templates/mega-menu-content.html @@ -6,6 +6,7 @@
  • Statistics
  • Working with series
  • Memoization (caching)
  • +
  • Working with large data
  • Analyzing eye-movement data