ENH: unit of measurement / physical quantities #10349

mdk73 · 2015-06-13T16:24:37Z

quantities related
xref #2494
xref #1071

custom meta-data
xref #2485

It would be very convenient if unit support could be integrated into pandas.
Idea: pandas checks for the presence of a unit-attribute of columns and - if present - uses it

with 'print' to show the units e.g. below the column names
to calculate 'under the hood' with these units similar to the example below

For my example I use the module pint and add an attribute 'unit' to columns (and a 'title'...).

Example:

from pandas import DataFrame as DF
from pint import UnitRegistry
units = UnitRegistry()

class ColumnDescription():
    '''Column description with additional attributes.

    The idea is to use this description to be able to add unit and title
    attributes to a column description in one step.

    A list of ColumnDescriptions is than used as argument to DataFrame()
    with unit support.
    '''

    def __init__(self, name, data, title = None, unit = None):
        '''
        Args:
            name (str): Name of the column..
            data (list): List of the column data.
            title (str): Title of the column. Defaults to None.
            unit (str): Unit of the column (see documentation of module pint).
                Defaults to None.

        '''

        self.data = data 
        '''(list): List of the column data.'''

        self.name = name
        '''(str): Name of the column, naming convention similar to python variables.

        Used to access the column with pandas syntax, e.g. df['column'] or df.column.
        '''

        self.title = title 
        '''(str): Title of the column. 

        More human readable than the 'name'. E.g.:
        Title: 'This is a column title'.
        name: 'column_title'.
        '''

        self.unit = unit
        '''Unit of the column (see module pint).

        Intended to be used in calculations involving different columns.
        '''

class DataFrame(DF):
    '''Data Frame with support for ColumnDescriptions (e.g. unit support).

    1. See documentation of pandas.DataFrame.
    2. When used with ColumnDescriptions supports additional column attributes
    like title and unit.
    '''

    def __init__(self, data, title = None):
        '''
        Args:
            data (list or dict):
                1. Dict, as in documentation of DataFrame
                2. List of the column data (of type ColumnDescription).
            title (str): Title of the data frame. Defaults to None.
        '''

        if isinstance(data, list):
            if isinstance(data[0], ColumnDescription):
                d = {}

                for column in data:
                    d[column.name] = column.data

                super(DataFrame, self).__init__(d)

                for column in data:
                    self[column.name].title = column.title
                    self[column.name].unit = column.unit

                self.title = title

        else:
            super(DataFrame, self).__init__(data)

if __name__ == '__main__':

    data = [ ColumnDescription('length',
                               [1, 10],
                               title = 'Length in meter',
                               unit = 'meter'),
             ColumnDescription('time',
                               [10, 1],
                               title = 'Time in s',
                               unit = 's') ]

    d = {'length':[1, 10],
         'time': [10, 1]}
    df = DataFrame(d)
    print 'standard df'
    print df

    df = DataFrame(data)
    print '\n' + 'new df'
    print df

    ####use of dimensions####
    # pint works with numpy arrays
    # df[name] is currently not working with pint, but would be I think 
    # it would be a real enhancement if it would...
    test = df.as_matrix(['length']) * units(df['length'].unit) / \
           (df.as_matrix(['time']) * units(df['time'].unit))
    print '\n' + 'unit test'
    print test
    print '\n' + 'magnitude'
    print test.magnitude
    print '\n' + 'dimensionality'
    print test.dimensionality

jreback · 2015-06-13T16:40:28Z

see #2485

The tricky thing with this is how to actually propogate this meta-data. I think this could work if it was attached to the index itself (as an optional additional array of meta data). If this were achieved, then this should be straightforward to have operations work on it (though to be honest that is a bit out of scope for main pandas, perhaps a sub-class / other library would be better).

mdk73 · 2015-06-13T18:24:07Z

Thanks for your comment.
I am not sure what you mean with attaching metadata to the index, and why this is important.

Maybe the proposed way with adding an attribute 'unit' to the columns is not the best way, but hopefully units are significantly less difficult than arbitrary metadata.
Personally I do not think that an attribute 'unit' needs to support all kind of data, 'str' could be enough.

I think pint (there are other modules, but I do not know them, sorry) is capable of taking care about the units itself (also throwing errors when misused), so this would not be a pandas issue.

Here is a small snippet that demonstrates how a new unit could be created if two columns are multiplicated:

#prototype column1, omitting the name and index
value1 = [1]
unit1 = 'meter'
# column1: representation of value and unit
column1 = value1 * units(unit1)
# column2: representation of value and unit
column2 = [2] * units('meter')
# creating a new column: column1 * column2
column12 = column1 * column2
print 'column12: {}'.format(column12)
# value could go to a new column of a DataFrame
print 'value of column12: {}'.format(column12.magnitude)
# str(column12.units) could serve as the unit-attribute for the new column
print 'unit of column12: {}'.format(column12.units)

output:

column12: [2] meter ** 2
value of column12: [2]
unit of column12: meter ** 2

jreback · 2015-06-14T12:35:26Z

@mdk73 as I said this could be done, but there are lots and lots of tests cases and behavior that are needed, e.g.

x = DataFrame with some quantities
y = DataFrame with no quantities
z = DataFrame with different quantities

so what are

x * x
x * y
x * z

these may seem completely obvious, and for the most part they are, but you have to propogate things very carefully. As I said, this is a natural attachment for the Index, a new property that can be examined (kind of how .name works).

The way to investigate is to add the property and write a suite of tests that ensure correct propogation on the Index object, e.g. things like: .reindex,.union,.intersection,__init__ etc.

shoyer · 2015-06-15T17:19:06Z

Unit aware arrays are indeed be extremely valuable for some use cases, but it's difficult to see how they could be integrated into the core of pandas in a way that is agnostic about the particular implementation. We definitely do not want to duplicate the work of pint or other similar packages in pandas, nor even pick a preferred units library. Instead, we want to define a useful set of extension points, e.g., similar to __numpy_ufunc__. So, this won't be an easy change, and possibly is something best reserved for thinking about in the design of "pandas 2.0".

blalterman · 2015-11-12T05:00:00Z

What about having a user define a dictionary containing any units she or he uses via pd.set_option. Whenever pandas does a calculation, it checks all objects in the calculation. It then takes all units and combines them just as would be in the function (e.g. pass all units through the function?). If an object has no units, take units as 1. At the end of the computation, you can then specify a new unit for the result and pandas will divide out the units accordingly. Alternatively, whenever pandas does a calculation, it can just multiply any values (perhaps excluding a user-defined flag value) and then run the calculation, converting out at the end. This is how I run a lot of my calculations. Why not do something like use lines with to_SI?

def traditional(b_mag, rho_vals, fill=-9999):
    """Calculate the Alfven speed."""

    # I store all of my physical constants in `_pccv`.
    mu0 = _pccv.misc['mu0'] #.physical_constants

    # Have pandas do this to every value before a computation.
    b_to_SI   = _pccv.to_SI['b']
    rho_to_SI = _pccv.to_SI['rho']
    v_to_SI   = _pccv.to_SI['v']    

    b = b_mag.copy() * b_to_SI
    rho = rho_vals.copy() * rho_to_SI

    if rho.ndim > 1: rho = rho.sum(axis=_argmin(rho.shape))

    Ca_denominator = _sqrt(mu0 * rho, dtype=_float64)
    Ca_calc = _divide(b, Ca_denominator, dtype=_float64)

    # At the end of your computation, specify the output unit and the 
    # following line would be run automatically.
    Ca_kms = Ca_calc / v_to_SI    

    return Ca_kms

shoyer · 2015-11-12T06:30:58Z

There are several existing approaches to units in Python -- notably pint and astropy.units. We should definitely be careful before reinventing the wheel here.

den-run-ai · 2015-11-17T15:45:54Z

+1 on units, especially for plots with multiple axis:

http://matplotlib.org/examples/axes_grid/demo_parasite_axes2.html

mikofski · 2015-12-05T19:26:24Z

Similar to #2494

VelizarVESSELINOV · 2015-12-29T19:26:17Z

👍 units awareness of the column as a string it is a good enough first step to be able to store the unit associated with the column. It will be nice read_csv to be able to capture the unit line and store them. Other metadata enhancement will be nice to store is description for each column or even history if some operations are done with the column.

I think the community will be not able to align on unit naming conversion this should be managed outside pandas, also the conversion factors can be managed outside the pandas.

Unit name challenge: there are a lot of unit aliases and in some case conflicts.
I think the community will be not able to align on unit naming conversion this should be managed outside pandas, also the conversion factors can be managed outside the pandas.

Unit name challenge: there are a lot of unit aliases and in some case conflicts.
There are a lot of units spellings B for different purposes is it Bytes, Bites or Bels https://en.wikipedia.org/wiki/Decibel#bel
Or S is it seconds, Siemens https://en.wikipedia.org/wiki/Siemens_(unit)

In my domain, UOM from Energistics (http://www.energistics.org/asset-data-management/unit-of-measure-standard) is covering most of my needs, but I agree for people that manage more digital storage units or date time units maybe this is out of scope.

jreback · 2015-12-29T19:32:25Z

I think a very straightforward way of doing this (though will get a bit of flack from @shoyer, @wesm, @njsmith @teoliphant for not doing this in c :<) is to simply define an 'extension pandas dtype' along the lines of DatetimeTZDtype.

E.g. you would have a float64 like dtype with a unit parameter (which could be a value from one of the units libraries, so pandas is basically agnostic).

Then would need some modification to the ops routines to handle the interactions.

tomchor · 2017-08-01T16:18:15Z

Just to make things more explicit, this same discussion is happening at a pint's issue (that is actually referenced here).

I think there should be an exchange of information from both sides to make robust solution and to avoid "reinventing the wheel", but IMHO the actual implementation should come from pint, with pandas only providing a good base for it (as some comments here have already said).

Bernhard10 · 2017-08-02T10:49:12Z

I tried to follow @jreback's idea of adding an additional dtype. My pull request is not ready to merge, but an outline how it could work.

@tomchor I started to write this pull request yesterday, before you commented that you would prefer to implement this in pint instead, that's why I post it here.

mikofski · 2017-08-02T14:37:47Z

@Bernhard10 any reason you choose not to use Pint or Quantities or another established, mature, tested, robust, popular units package?

tomchor · 2017-08-02T15:23:46Z

@Bernhard10 I think the additional dtype can work. I'm happy someone's working on it.

About implementing it in Pint, unfortunately I'm not the man to do it (at least right now). I still have a lot to learn about Pint and I have some other urgent priorities to take care.

@mikofski I guess Pint looks like a better candidate (at least for me) because it seems more intuitive and simpler. But I guess there would be no strong argument against using Quantities. I think the point of providing a general basis for the implementation in Pandas (such as the dtype idea) is because it can be implemented by whatever units package indepedently. So people using Pint could easily develop support for Pandas, as could people using Quantities.

mikofski · 2017-08-02T15:37:19Z

@tomchor, I do think a backend approach that allowed the units package to be swappable is the best approach. Also I agree, Pint is easier and more popular IMHO than Quantities right now, although before Pint, Quantities was definitely the most popular, and is still very good

@Bernhard10, if you are implementing dtype approach, maybe look at Quantities first and talk to their maintainers because Quantities also uses dtype so this may save you a lot of time and testing. Also please consider making your pandas units abstract, defaulting to your version but allowing any other suitable backend to be used as long as it implements the abstract API

Bernhard10 · 2017-08-02T18:20:25Z

@mikofski I am currently testing with pint, but the idea of the dtype approach would be to make the units package swappable.

@andrewgsavage

684: Add pandas support r=hgrecco a=znicholls This pull request adds pandas support to pint (hence is related to #645, #401 and pandas-dev/pandas#10349). An example can be seen in `example-notebooks/basic-example.ipynb`. It's a little bit hacksih, feedback would be greatly appreciated by me and @andrewgsavage. One obvious example is that we have to run all the interface tests with `pytest` to fit with `pandas` test suite, which introduces a dependency for the CI and currently gives us this awkward testing setup (see the alterations we had to make to `testsuite`). This also means that our code coverage tests are fiddly too. If you'd like us to squash the commits, that can be done. If pint has a linter, it would be good to run that over this pull request too as we're a little bit all over the place re style. Things to discuss: - [x] general feedback and changes - [x] test setup, especially need for pytest for pandas tests and hackish way to get around automatic discovery - [x] squashing/rebasing - [x] linting/other code style (related to #664 and #628: we're happy with whatever, I've found using an automatic linter e.g. black and/or flake8 has made things much simpler in other projects) - [x] including notebooks in the repo (if we want to, I'm happy to put them under CI so we can make sure they run) - [x] setting up the docs correctly Co-authored-by: Zebedee Nicholls <[email protected]> Co-authored-by: andrewgsavage <[email protected]>

UniqASL · 2020-04-16T07:53:17Z

I think the issue of dealing with units in pandas should really be at the top of the agenda. Most scientific calculations have to deal with units. I am currently using pint and its module pint-pandas, which offers indeed very nice possibilities. It is very practical to have an automatized way to deal with unit conversions when making calculations with large dataframes. It also brings more safety in the calculations as it avoids "multiplying apples and oranges".
pint-pandas is however in its current state having a lot of issues and it is quite a pain to deal with. Wouldn't it make sense to integrate all the nice work already done there directly in pandas, so that issues can be easier fixed?

pandas can deal very well with time series and I am very thankful for that because it is a very useful feature. Dealing with units should be in my opinion another core feature of pandas (especially as part of NumFOCUS).

Thanks in advance for your feedback.

5igno · 2021-04-13T18:21:29Z

Hi there, would indeed be great to be able to store the measurement unit for the columns of DataFrame or as some kind of metadata. Given the breath of use of pandas, of which only a small subset cares deeply about units, I am of the opinion that unit conversion could be handled externally by packages like pint or astropy. However, one thing that would be needed to both approaches is that values need to be stored with a unit in the dataframe. As suggested before, this can also be a simple string metadata field.

Is there a suggested way to simply store measurement units in a Pandas DataFrame column?

wkerzendorf · 2021-07-22T23:19:34Z

I think I have an idea how to marry astropy quantities with could work using the extension part of pandas.

Extending a pandas DataFrame:

class QDataFrame(pd.DataFrame):
    # normal properties
    _metadata = ["units"]
    def __init__(self, *args, units=[], **kwargs):
        super(QDataFrame, self).__init__(*args, **kwargs)
        self.units = units
    @property
    def _constructor(self):
        return QDataFrame

Then an accessor like this for the series.

@pd.api.extensions.register_series_accessor("quantity")
class QuantityAccessor():
    def __init__(self, pandas_obj):
        self._obj = pandas_obj
    
    def to(self, unit):
        return (self._obj.values * self._obj.unit).to(unit)

I'm struggling how to now propagate the list of units to the single column via _constructor_sliced as I do not seem to know which column it is accessing. Any ideas @jreback ?

znicholls · 2021-08-31T09:20:58Z

As an update on efforts with pint-pandas, we have hit the wall with #35131. The crux is how to handle this scalar/array ambiguity without potentially destroying performance elsewhere.

The first attempt at a fix was #35127, that was superseded by #39790 but that went stale.

The latest person to raise this (as far as I'm aware) is #43196.

It seems this is still an issue, but not one which is high enough priority to get over the tricky hurdles required to allow things to move forward.

demisjohn · 2022-01-15T03:24:33Z

Is there a way to make an extra "Dimension" of the Database (not just Row & Column)? So you could have "Unit" strings (or any other data), associated with each Column, in addition to the Column "name".
Or, is there a "relational DB" method that could accomplish the same, with pandas?
Whichever is syntactically and cognitively simpler would be best for most scientists.

I am thinking of a solution that may already exist within Pandas - and a documentation suggestion/example may show users a simple implementation.

mroeschke · 2023-03-29T04:38:13Z

Looks like pint-pandas (https://github.com/hgrecco/pint-pandas) is a 3rd party library that provides this support which is mentioned in our ecosystem docs so closing

jreback added Ideas Long-Term Enhancement Discussions Indexing Related to indexing on series/frames, not to indexes themselves API Design Difficulty Advanced Needs Discussion Requires discussion from core team before further action labels Jun 13, 2015

jreback added this to the Someday milestone Jun 14, 2015

jreback mentioned this issue Nov 30, 2015

Aliases for column names #11723

Open

This was referenced Dec 5, 2015

DataFrame does not work well with quantities #1071

Closed

better quantities support #2494

Closed

markbandstra mentioned this issue Jan 31, 2017

Unit conventions lbl-anp/becquerel#12

Closed

jreback mentioned this issue Mar 16, 2017

request for implementation of astropy.units #15698

Closed

Bernhard10 mentioned this issue Aug 1, 2017

Consider support for pandas hgrecco/pint#401

Closed

Bernhard10 mentioned this issue Aug 2, 2017

WIP: prototype for unit support #10349 #17153

Closed

4 tasks

This was referenced Aug 29, 2018

Add pandas support hgrecco/pint#684

Merged

Update ecosystem.rst to include Pint #22582

Closed

This was referenced Sep 7, 2018

] ESIPFed/NUMfocusFallDev#1

Closed

Integrate unit support into pandas through integration with pint or other units packages ESIPFed/NUMfocusFallDev#2

Open

watsonjj mentioned this issue Nov 29, 2018

Decide on standardized I/O formats cta-observatory/ctapipe#843

Closed

jbrockmendel removed Difficulty Advanced labels Oct 21, 2019

jbrockmendel removed the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 11, 2020

a-golda mentioned this issue Mar 1, 2020

Subclassing pd.DataFrame with the addition of units tardis-sn/tardis#1069

Closed

6 tasks

mroeschke added Enhancement and removed Ideas Long-Term Enhancement Discussions Needs Discussion Requires discussion from core team before further action labels Apr 4, 2020

jbrockmendel mentioned this issue Apr 13, 2020

ENH: Canonic SI frequency #33524

Closed

znicholls mentioned this issue Jul 5, 2020

ENH: Identify numpy-like zero-dimensional arrays as non-iterable #35131

Closed

mroeschke mentioned this issue Feb 10, 2021

ENH: handling units. #39713

Closed

cmeyer mentioned this issue Feb 24, 2021

Add table items and functionality and UI nion-software/nionswift#433

Open

jreback mentioned this issue May 24, 2021

ENH: Add a unit conversion method for Pandas Series #41641

Closed

gerald-scharitzer mentioned this issue Nov 23, 2021

Decorate model functions to indicate their units ProjectDrawdown/solutions#361

Closed

jbrockmendel removed the API Design label Dec 24, 2021

wkerzendorf mentioned this issue Jan 24, 2022

QDataFrame #45602

Closed

4 tasks

Ivorforce mentioned this issue Jun 3, 2022

Add to_dataframe() to Record. MIT-LCP/wfdb-python#380

Merged

mroeschke removed this from the Someday milestone Oct 13, 2022

mroeschke closed this as completed Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: unit of measurement / physical quantities #10349

ENH: unit of measurement / physical quantities #10349

mdk73 commented Jun 13, 2015

jreback commented Jun 13, 2015

mdk73 commented Jun 13, 2015

jreback commented Jun 14, 2015

shoyer commented Jun 15, 2015

blalterman commented Nov 12, 2015

shoyer commented Nov 12, 2015

den-run-ai commented Nov 17, 2015

mikofski commented Dec 5, 2015

VelizarVESSELINOV commented Dec 29, 2015

jreback commented Dec 29, 2015

tomchor commented Aug 1, 2017

Bernhard10 commented Aug 2, 2017

mikofski commented Aug 2, 2017

tomchor commented Aug 2, 2017

mikofski commented Aug 2, 2017

Bernhard10 commented Aug 2, 2017

UniqASL commented Apr 16, 2020

5igno commented Apr 13, 2021

wkerzendorf commented Jul 22, 2021

znicholls commented Aug 31, 2021

demisjohn commented Jan 15, 2022 •

edited

Loading

mroeschke commented Mar 29, 2023

ENH: unit of measurement / physical quantities #10349

ENH: unit of measurement / physical quantities #10349

Comments

mdk73 commented Jun 13, 2015

jreback commented Jun 13, 2015

mdk73 commented Jun 13, 2015

jreback commented Jun 14, 2015

shoyer commented Jun 15, 2015

blalterman commented Nov 12, 2015

shoyer commented Nov 12, 2015

den-run-ai commented Nov 17, 2015

mikofski commented Dec 5, 2015

VelizarVESSELINOV commented Dec 29, 2015

jreback commented Dec 29, 2015

tomchor commented Aug 1, 2017

Bernhard10 commented Aug 2, 2017

mikofski commented Aug 2, 2017

tomchor commented Aug 2, 2017

mikofski commented Aug 2, 2017

Bernhard10 commented Aug 2, 2017

UniqASL commented Apr 16, 2020

5igno commented Apr 13, 2021

wkerzendorf commented Jul 22, 2021

znicholls commented Aug 31, 2021

demisjohn commented Jan 15, 2022 • edited Loading

mroeschke commented Mar 29, 2023

demisjohn commented Jan 15, 2022 •

edited

Loading