Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: unit of measurement / physical quantities #10349

Closed
mdk73 opened this issue Jun 13, 2015 · 28 comments
Closed

ENH: unit of measurement / physical quantities #10349

mdk73 opened this issue Jun 13, 2015 · 28 comments
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@mdk73
Copy link

mdk73 commented Jun 13, 2015

quantities related
xref #2494
xref #1071

custom meta-data
xref #2485

It would be very convenient if unit support could be integrated into pandas.
Idea: pandas checks for the presence of a unit-attribute of columns and - if present - uses it

  • with 'print' to show the units e.g. below the column names
  • to calculate 'under the hood' with these units similar to the example below

For my example I use the module pint and add an attribute 'unit' to columns (and a 'title'...).

Example:

from pandas import DataFrame as DF
from pint import UnitRegistry
units = UnitRegistry()

class ColumnDescription():
    '''Column description with additional attributes.

    The idea is to use this description to be able to add unit and title
    attributes to a column description in one step.

    A list of ColumnDescriptions is than used as argument to DataFrame()
    with unit support.
    '''

    def __init__(self, name, data, title = None, unit = None):
        '''
        Args:
            name (str): Name of the column..
            data (list): List of the column data.
            title (str): Title of the column. Defaults to None.
            unit (str): Unit of the column (see documentation of module pint).
                Defaults to None.

        '''

        self.data = data 
        '''(list): List of the column data.'''

        self.name = name
        '''(str): Name of the column, naming convention similar to python variables.

        Used to access the column with pandas syntax, e.g. df['column'] or df.column.
        '''

        self.title = title 
        '''(str): Title of the column. 

        More human readable than the 'name'. E.g.:
        Title: 'This is a column title'.
        name: 'column_title'.
        '''

        self.unit = unit
        '''Unit of the column (see module pint).

        Intended to be used in calculations involving different columns.
        '''

class DataFrame(DF):
    '''Data Frame with support for ColumnDescriptions (e.g. unit support).

    1. See documentation of pandas.DataFrame.
    2. When used with ColumnDescriptions supports additional column attributes
    like title and unit.
    '''

    def __init__(self, data, title = None):
        '''
        Args:
            data (list or dict):
                1. Dict, as in documentation of DataFrame
                2. List of the column data (of type ColumnDescription).
            title (str): Title of the data frame. Defaults to None.
        '''

        if isinstance(data, list):
            if isinstance(data[0], ColumnDescription):
                d = {}

                for column in data:
                    d[column.name] = column.data

                super(DataFrame, self).__init__(d)

                for column in data:
                    self[column.name].title = column.title
                    self[column.name].unit = column.unit

                self.title = title

        else:
            super(DataFrame, self).__init__(data)

if __name__ == '__main__':

    data = [ ColumnDescription('length',
                               [1, 10],
                               title = 'Length in meter',
                               unit = 'meter'),
             ColumnDescription('time',
                               [10, 1],
                               title = 'Time in s',
                               unit = 's') ]

    d = {'length':[1, 10],
         'time': [10, 1]}
    df = DataFrame(d)
    print 'standard df'
    print df

    df = DataFrame(data)
    print '\n' + 'new df'
    print df

    ####use of dimensions####
    # pint works with numpy arrays
    # df[name] is currently not working with pint, but would be I think 
    # it would be a real enhancement if it would...
    test = df.as_matrix(['length']) * units(df['length'].unit) / \
           (df.as_matrix(['time']) * units(df['time'].unit))
    print '\n' + 'unit test'
    print test
    print '\n' + 'magnitude'
    print test.magnitude
    print '\n' + 'dimensionality'
    print test.dimensionality
@jreback
Copy link
Contributor

jreback commented Jun 13, 2015

see #2485

The tricky thing with this is how to actually propogate this meta-data. I think this could work if it was attached to the index itself (as an optional additional array of meta data). If this were achieved, then this should be straightforward to have operations work on it (though to be honest that is a bit out of scope for main pandas, perhaps a sub-class / other library would be better).

@jreback jreback added Ideas Long-Term Enhancement Discussions Indexing Related to indexing on series/frames, not to indexes themselves API Design Difficulty Advanced Needs Discussion Requires discussion from core team before further action labels Jun 13, 2015
@mdk73
Copy link
Author

mdk73 commented Jun 13, 2015

Thanks for your comment.
I am not sure what you mean with attaching metadata to the index, and why this is important.

Maybe the proposed way with adding an attribute 'unit' to the columns is not the best way, but hopefully units are significantly less difficult than arbitrary metadata.
Personally I do not think that an attribute 'unit' needs to support all kind of data, 'str' could be enough.

I think pint (there are other modules, but I do not know them, sorry) is capable of taking care about the units itself (also throwing errors when misused), so this would not be a pandas issue.

Here is a small snippet that demonstrates how a new unit could be created if two columns are multiplicated:

#prototype column1, omitting the name and index
value1 = [1]
unit1 = 'meter'
# column1: representation of value and unit
column1 = value1 * units(unit1)
# column2: representation of value and unit
column2 = [2] * units('meter')
# creating a new column: column1 * column2
column12 = column1 * column2
print 'column12: {}'.format(column12)
# value could go to a new column of a DataFrame
print 'value of column12: {}'.format(column12.magnitude)
# str(column12.units) could serve as the unit-attribute for the new column
print 'unit of column12: {}'.format(column12.units)

output:

column12: [2] meter ** 2
value of column12: [2]
unit of column12: meter ** 2

@jreback
Copy link
Contributor

jreback commented Jun 14, 2015

@mdk73 as I said this could be done, but there are lots and lots of tests cases and behavior that are needed, e.g.

x = DataFrame with some quantities
y = DataFrame with no quantities
z = DataFrame with different quantities

so what are

x * x
x * y
x * z

these may seem completely obvious, and for the most part they are, but you have to propogate things very carefully. As I said, this is a natural attachment for the Index, a new property that can be examined (kind of how .name works).

The way to investigate is to add the property and write a suite of tests that ensure correct propogation on the Index object, e.g. things like: .reindex,.union,.intersection,__init__ etc.

@jreback jreback added this to the Someday milestone Jun 14, 2015
@shoyer
Copy link
Member

shoyer commented Jun 15, 2015

Unit aware arrays are indeed be extremely valuable for some use cases, but it's difficult to see how they could be integrated into the core of pandas in a way that is agnostic about the particular implementation. We definitely do not want to duplicate the work of pint or other similar packages in pandas, nor even pick a preferred units library. Instead, we want to define a useful set of extension points, e.g., similar to __numpy_ufunc__. So, this won't be an easy change, and possibly is something best reserved for thinking about in the design of "pandas 2.0".

@blalterman
Copy link

What about having a user define a dictionary containing any units she or he uses via pd.set_option. Whenever pandas does a calculation, it checks all objects in the calculation. It then takes all units and combines them just as would be in the function (e.g. pass all units through the function?). If an object has no units, take units as 1. At the end of the computation, you can then specify a new unit for the result and pandas will divide out the units accordingly. Alternatively, whenever pandas does a calculation, it can just multiply any values (perhaps excluding a user-defined flag value) and then run the calculation, converting out at the end. This is how I run a lot of my calculations. Why not do something like use lines with to_SI?

def traditional(b_mag, rho_vals, fill=-9999):
    """Calculate the Alfven speed."""

    # I store all of my physical constants in `_pccv`.
    mu0 = _pccv.misc['mu0'] #.physical_constants

    # Have pandas do this to every value before a computation.
    b_to_SI   = _pccv.to_SI['b']
    rho_to_SI = _pccv.to_SI['rho']
    v_to_SI   = _pccv.to_SI['v']    

    b = b_mag.copy() * b_to_SI
    rho = rho_vals.copy() * rho_to_SI

    if rho.ndim > 1: rho = rho.sum(axis=_argmin(rho.shape))

    Ca_denominator = _sqrt(mu0 * rho, dtype=_float64)
    Ca_calc = _divide(b, Ca_denominator, dtype=_float64)

    # At the end of your computation, specify the output unit and the 
    # following line would be run automatically.
    Ca_kms = Ca_calc / v_to_SI    

    return Ca_kms

@shoyer
Copy link
Member

shoyer commented Nov 12, 2015

There are several existing approaches to units in Python -- notably pint and astropy.units. We should definitely be careful before reinventing the wheel here.

@den-run-ai
Copy link

+1 on units, especially for plots with multiple axis:

http://matplotlib.org/examples/axes_grid/demo_parasite_axes2.html

@mikofski
Copy link

mikofski commented Dec 5, 2015

Similar to #2494

@VelizarVESSELINOV
Copy link

👍 units awareness of the column as a string it is a good enough first step to be able to store the unit associated with the column. It will be nice read_csv to be able to capture the unit line and store them. Other metadata enhancement will be nice to store is description for each column or even history if some operations are done with the column.

I think the community will be not able to align on unit naming conversion this should be managed outside pandas, also the conversion factors can be managed outside the pandas.

Unit name challenge: there are a lot of unit aliases and in some case conflicts.
I think the community will be not able to align on unit naming conversion this should be managed outside pandas, also the conversion factors can be managed outside the pandas.

Unit name challenge: there are a lot of unit aliases and in some case conflicts.
There are a lot of units spellings B for different purposes is it Bytes, Bites or Bels https://en.wikipedia.org/wiki/Decibel#bel
Or S is it seconds, Siemens https://en.wikipedia.org/wiki/Siemens_(unit)

In my domain, UOM from Energistics (http://www.energistics.org/asset-data-management/unit-of-measure-standard) is covering most of my needs, but I agree for people that manage more digital storage units or date time units maybe this is out of scope.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2015

I think a very straightforward way of doing this (though will get a bit of flack from @shoyer, @wesm, @njsmith @teoliphant for not doing this in c :<) is to simply define an 'extension pandas dtype' along the lines of DatetimeTZDtype.

E.g. you would have a float64 like dtype with a unit parameter (which could be a value from one of the units libraries, so pandas is basically agnostic).

Then would need some modification to the ops routines to handle the interactions.

@tomchor
Copy link

tomchor commented Aug 1, 2017

Just to make things more explicit, this same discussion is happening at a pint's issue (that is actually referenced here).

I think there should be an exchange of information from both sides to make robust solution and to avoid "reinventing the wheel", but IMHO the actual implementation should come from pint, with pandas only providing a good base for it (as some comments here have already said).

@Bernhard10
Copy link
Contributor

I tried to follow @jreback's idea of adding an additional dtype. My pull request is not ready to merge, but an outline how it could work.

@tomchor I started to write this pull request yesterday, before you commented that you would prefer to implement this in pint instead, that's why I post it here.

@mikofski
Copy link

mikofski commented Aug 2, 2017

@Bernhard10 any reason you choose not to use Pint or Quantities or another established, mature, tested, robust, popular units package?

@tomchor
Copy link

tomchor commented Aug 2, 2017

@Bernhard10 I think the additional dtype can work. I'm happy someone's working on it.

About implementing it in Pint, unfortunately I'm not the man to do it (at least right now). I still have a lot to learn about Pint and I have some other urgent priorities to take care.

@mikofski I guess Pint looks like a better candidate (at least for me) because it seems more intuitive and simpler. But I guess there would be no strong argument against using Quantities. I think the point of providing a general basis for the implementation in Pandas (such as the dtype idea) is because it can be implemented by whatever units package indepedently. So people using Pint could easily develop support for Pandas, as could people using Quantities.

@mikofski
Copy link

mikofski commented Aug 2, 2017

@tomchor, I do think a backend approach that allowed the units package to be swappable is the best approach. Also I agree, Pint is easier and more popular IMHO than Quantities right now, although before Pint, Quantities was definitely the most popular, and is still very good

@Bernhard10, if you are implementing dtype approach, maybe look at Quantities first and talk to their maintainers because Quantities also uses dtype so this may save you a lot of time and testing. Also please consider making your pandas units abstract, defaulting to your version but allowing any other suitable backend to be used as long as it implements the abstract API

@Bernhard10
Copy link
Contributor

@mikofski I am currently testing with pint, but the idea of the dtype approach would be to make the units package swappable.

bors bot added a commit to hgrecco/pint that referenced this issue Sep 6, 2018
684: Add pandas support r=hgrecco a=znicholls

This pull request adds pandas support to pint (hence is related to #645, #401 and pandas-dev/pandas#10349).

An example can be seen in `example-notebooks/basic-example.ipynb`.

It's a little bit hacksih, feedback would be greatly appreciated by me and @andrewgsavage. One obvious example is that we have to run all the interface tests with `pytest` to fit with `pandas` test suite, which introduces a dependency for the CI and currently gives us this awkward testing setup (see the alterations we had to make to `testsuite`). This also means that our code coverage tests are fiddly too.

If you'd like us to squash the commits, that can be done.

If pint has a linter, it would be good to run that over this pull request too as we're a little bit all over the place re style.

Things to discuss:

- [x]  general feedback and changes
- [x] test setup, especially need for pytest for pandas tests and hackish way to get around automatic discovery
- [x] squashing/rebasing
- [x] linting/other code style (related to #664 and #628: we're happy with whatever, I've found using an automatic linter e.g. black and/or flake8 has made things much simpler in other projects)
- [x] including notebooks in the repo (if we want to, I'm happy to put them under CI so we can make sure they run)
- [x] setting up the docs correctly

Co-authored-by: Zebedee Nicholls <[email protected]>
Co-authored-by: andrewgsavage <[email protected]>
@jbrockmendel jbrockmendel removed the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 11, 2020
@mroeschke mroeschke added Enhancement and removed Ideas Long-Term Enhancement Discussions Needs Discussion Requires discussion from core team before further action labels Apr 4, 2020
@UniqASL
Copy link

UniqASL commented Apr 16, 2020

I think the issue of dealing with units in pandas should really be at the top of the agenda. Most scientific calculations have to deal with units. I am currently using pint and its module pint-pandas, which offers indeed very nice possibilities. It is very practical to have an automatized way to deal with unit conversions when making calculations with large dataframes. It also brings more safety in the calculations as it avoids "multiplying apples and oranges".
pint-pandas is however in its current state having a lot of issues and it is quite a pain to deal with. Wouldn't it make sense to integrate all the nice work already done there directly in pandas, so that issues can be easier fixed?

pandas can deal very well with time series and I am very thankful for that because it is a very useful feature. Dealing with units should be in my opinion another core feature of pandas (especially as part of NumFOCUS).

Thanks in advance for your feedback.

@5igno
Copy link

5igno commented Apr 13, 2021

Hi there, would indeed be great to be able to store the measurement unit for the columns of DataFrame or as some kind of metadata. Given the breath of use of pandas, of which only a small subset cares deeply about units, I am of the opinion that unit conversion could be handled externally by packages like pint or astropy. However, one thing that would be needed to both approaches is that values need to be stored with a unit in the dataframe. As suggested before, this can also be a simple string metadata field.

Is there a suggested way to simply store measurement units in a Pandas DataFrame column?

@wkerzendorf
Copy link

I think I have an idea how to marry astropy quantities with could work using the extension part of pandas.

Extending a pandas DataFrame:

class QDataFrame(pd.DataFrame):
    # normal properties
    _metadata = ["units"]
    def __init__(self, *args, units=[], **kwargs):
        super(QDataFrame, self).__init__(*args, **kwargs)
        self.units = units
    @property
    def _constructor(self):
        return QDataFrame

Then an accessor like this for the series.

@pd.api.extensions.register_series_accessor("quantity")
class QuantityAccessor():
    def __init__(self, pandas_obj):
        self._obj = pandas_obj
    
    def to(self, unit):
        return (self._obj.values * self._obj.unit).to(unit)

I'm struggling how to now propagate the list of units to the single column via _constructor_sliced as I do not seem to know which column it is accessing. Any ideas @jreback ?

@znicholls
Copy link
Contributor

As an update on efforts with pint-pandas, we have hit the wall with #35131. The crux is how to handle this scalar/array ambiguity without potentially destroying performance elsewhere.

The first attempt at a fix was #35127, that was superseded by #39790 but that went stale.

The latest person to raise this (as far as I'm aware) is #43196.

It seems this is still an issue, but not one which is high enough priority to get over the tricky hurdles required to allow things to move forward.

@demisjohn
Copy link

demisjohn commented Jan 15, 2022

Is there a way to make an extra "Dimension" of the Database (not just Row & Column)? So you could have "Unit" strings (or any other data), associated with each Column, in addition to the Column "name".
Or, is there a "relational DB" method that could accomplish the same, with pandas?
Whichever is syntactically and cognitively simpler would be best for most scientists.

I am thinking of a solution that may already exist within Pandas - and a documentation suggestion/example may show users a simple implementation.

@wkerzendorf wkerzendorf mentioned this issue Jan 24, 2022
4 tasks
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@mroeschke
Copy link
Member

Looks like pint-pandas (https://github.com/hgrecco/pint-pandas) is a 3rd party library that provides this support which is mentioned in our ecosystem docs so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging a pull request may close this issue.