Skip to content
This repository has been archived by the owner on Mar 31, 2019. It is now read-only.

Broadcasting ? #24

Closed
imandr opened this issue Jun 17, 2018 · 17 comments
Closed

Broadcasting ? #24

imandr opened this issue Jun 17, 2018 · 17 comments

Comments

@imandr
Copy link

imandr commented Jun 17, 2018

Imagine I have a histogram like this:

Hist(bin("x", 100, 0.0, 1.0), groupby("condition"))

and I have an array of x collected under condition="test". So I want to put these data into the histogram. To do that, I need to do something like this:

hist.fill(x=xarray, condition=["test"]*len(xarray))

I think it would be useful to adopt a simple "broadcasting" rule to allow this:

hist.fill(x=xarray, condition="test")

@jpivarski
Copy link
Member

hist.fill(x=xarray, condition="'test'") (one more level of quoting) might already work. If not, it's related to the constants-handling bug (#23, I think).

@imandr
Copy link
Author

imandr commented Jun 17, 2018

No, that does not work and it generates this:

ValueError: array 'tt' has len 1 but other arrays have len 100

h = Hist(bin("a", 10, 0, 1), groupby("tt"))
h.fill(a=np.random.random((100,)), tt='"a"')
h.fill(a=np.random.random((100,)), tt='"b"')
h.stack("tt").area("a").to(canvas)

@imandr
Copy link
Author

imandr commented Jun 17, 2018

By the way, hist.fill(x=xarray, condition=["test"]*len(xarray)) seems to be very inefficient. It severely slows things down.

@jpivarski
Copy link
Member

The ValueError you got was correct. The suggestion I have you was wrong. I meant to say

hist = Hist(bin("x", 100, 0.0, 1.0), groupby("'test'"))
hist.fill(x=xarray)

I put the single string in quotations in the wrong place. It's supposed to be a constant expression. Also, because of #23, I might need to fix constant expressions.

["test"]*len(xarray) would be very slow, but that's because it's creating a large Python list and converting it to a string-valued Numpy array before it ever enters histbook.

@imandr
Copy link
Author

imandr commented Jun 17, 2018

No, that is not the use case I meant.

I meant this:

h = Hist(bin("data", 10, 0, 1), groupby("condition"))
h.fill(data=np.random.random((100,)), condition="a")
h.fill(data=np.random.random((100,)), condition="b")
...

Real life example: I want to plot some distribution for a number of datasets. So I call fill() some number of times for dataset "a", then for dataset "b" and so on, and then I want to see stacked histogram with all the datasets together.

So I am not grouping by a constant. It is constant only for a single call to fill(). Next call to fill() will use another value.

@jpivarski
Copy link
Member

I figured that out in the other issue. You want to use the Hist.group static method. (I'm trying to do all of this on a phone in a hotel room without wifi.)

@imandr
Copy link
Author

imandr commented Jun 17, 2018

I appreciate your help, Jim. Sorry you have to do this in such inhumane conditions :)

Maybe grouping would work too, but I got it figured out, implementing the "broadcasting" myself for now. Here is the fragment of real code which works:

h_all = Hist(hbin("NJets", 20, 0, 20), groupby("dataset"))

h_all_display = IPythonDisplay(
    h_all
        .stack("dataset")
        .area("NJets", width=400)
)

for dataset_name in Datasets[:5]:
    job = session.createJob(dataset_name, display=False)
    job.addHistogram(h_all, ["NJets", "dataset"], constants={"dataset":dataset_name})
    job.run()
    runtime = job.TFinish - job.TStart
    nevents = job.EventsProcessed
    nworkers = len(job.WorkerAddresses)
    h_all_display.update()
    print "%s: %d events processed using %d workers in %.1f seconds, %.2e events/second" % (
                    dataset_name, nevents, nworkers, runtime, float(nevents)/runtime)

@jpivarski
Copy link
Member

I don't see the fill in this. (Maybe that happens remotely?)

One of the complaints about Histogrammar was that everything that appeared in a final plot had to come from a single dataset, but it's common to combine data from different Monte Carlo and data samples. It seemed unnatural to have to create new fields in the data that were piecewise constant to emulate what the physicists had in mind: bringing together data from different places.

So histbook has methods for adding information (like "group") in addition to removing information (like "select", "rebin", and "project").

@imandr
Copy link
Author

imandr commented Jun 17, 2018

Yes, the actual filling happens remotely and once in a while the Histbook histogram gets picked and shipped back and then cleared.

Here is the implementation of the broadcasting at the remote:

       ...
       # by this time my_dict contains "regular" data in ndarray's
       if my_dict:
                nitems = len(my_dict[my_dict.keys()[0]])
                for dn, dv in self.Constants.items():
                        my_dict[dn] = [dv]*nitems
        self.H.fill(**my_dict)

jpivarski added a commit that referenced this issue Jun 18, 2018
…adcasted to the lengths of the other arrays
@jpivarski
Copy link
Member

>>> from histbook import *
>>> h = Hist(bin("x", 5, 0, 5), groupby("condition"))
>>> h.pandas()
Empty DataFrame
Columns: [count(), err(count())]
Index: []
>>> h.fill(x=[1, 2, 3], condition="one")
>>> h.pandas()
                       count()  err(count())
condition x                                 
one       [-inf, 0.0)      0.0           0.0
          [0.0, 1.0)       0.0           0.0
          [1.0, 2.0)       1.0           1.0
          [2.0, 3.0)       1.0           1.0
          [3.0, 4.0)       1.0           1.0
          [4.0, 5.0)       0.0           0.0
          [5.0, inf)       0.0           0.0
          {NaN}            0.0           0.0
>>> h.fill(x=[4, 5], condition="two")
>>> h.pandas()
                       count()  err(count())
condition x                                 
one       [-inf, 0.0)      0.0           0.0
          [0.0, 1.0)       0.0           0.0
          [1.0, 2.0)       1.0           1.0
          [2.0, 3.0)       1.0           1.0
          [3.0, 4.0)       1.0           1.0
          [4.0, 5.0)       0.0           0.0
          [5.0, inf)       0.0           0.0
          {NaN}            0.0           0.0
two       [-inf, 0.0)      0.0           0.0
          [0.0, 1.0)       0.0           0.0
          [1.0, 2.0)       0.0           0.0
          [2.0, 3.0)       0.0           0.0
          [3.0, 4.0)       0.0           0.0
          [4.0, 5.0)       1.0           1.0
          [5.0, inf)       1.0           1.0
          {NaN}            0.0           0.0

@jpivarski
Copy link
Member

Numerical values also get broadcasted:

>>> from histbook import *
>>> h = Hist(bin("x", 5, 0, 5), bin("y", 2, 0, 2, underflow=False, ove
rflow=False, nanflow=False))
>>> h.pandas()
                        count()  err(count())
x           y
[-inf, 0.0) [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[0.0, 1.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[1.0, 2.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[2.0, 3.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[3.0, 4.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[4.0, 5.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[5.0, inf)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
{NaN}       [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
>>> h.fill(x=[1, 2, 3], y=0)
>>> h.pandas()
                        count()  err(count())
x           y                                
[-inf, 0.0) [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[0.0, 1.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[1.0, 2.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[2.0, 3.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[3.0, 4.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[4.0, 5.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[5.0, inf)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
{NaN}       [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
>>> h.fill(x=[4, 5], y=1)
>>> h.pandas()
                        count()  err(count())
x           y                                
[-inf, 0.0) [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[0.0, 1.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[1.0, 2.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[2.0, 3.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[3.0, 4.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[4.0, 5.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      1.0           1.0
[5.0, inf)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      1.0           1.0
{NaN}       [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0

This was referenced Jun 18, 2018
@imandr
Copy link
Author

imandr commented Jun 22, 2018

I am trying to use broadcasting with groupby() and a string as the groupby axis value, and it is not working:

import histbook
from histbook import Hist, bin, groupby
import numpy as np

h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
h.fill(dataset="dataset1", NJets=np.arange(100))

this fragment gives me a warning and an error:

/home/ivm/anaconda2/lib/python2.7/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full(100, array('dataset1', 
      dtype='|S8')) will return an array of dtype('S8')
  format(shape, fill_value, array(fill_value).dtype), FutureWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-6c35a986016c> in <module>()
      4 
      5 h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
----> 6 h.fill(dataset="dataset1", NJets=np.arange(100))
      7 
      8 h1 = Hist(bin("NJets", 10, 0, 10), bin("dataset", 100, 0, 100))

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/hist.pyc in fill(self, arrays, **more)
    574 
    575             self._prefill()
--> 576             length = self._fill(arrays)
    577             self._postfill(arrays, length)
    578 

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/hist.pyc in _fill(self, arrays)
    151                     array = numpy.array(array)
    152                 if array.shape == ():
--> 153                     array = numpy.full(length, array)
    154 
    155                 if length != array.shape[0]:

/home/ivm/anaconda2/lib/python2.7/site-packages/numpy/core/numeric.pyc in full(shape, fill_value, dtype, order)
    300             "in the future, full({0}, {1!r}) will return an array of {2!r}".
    301             format(shape, fill_value, array(fill_value).dtype), FutureWarning)
--> 302     multiarray.copyto(a, fill_value, casting='unsafe')
    303     return a
    304 

ValueError: could not convert string to float: dataset1

this works fine:

h1 = Hist(bin("NJets", 10, 0, 10), bin("dataset", 100, 0, 100))
h1.fill(dataset=13, NJets=np.arange(100))

my version is 1.2.0

@jpivarski
Copy link
Member

jpivarski commented Jun 22, 2018

Hmmm.

>>> from histbook import *
>>> import numpy as np
>>> h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
>>> h.fill(dataset="dataset1", NJets=np.arange(100))
>>> h.pandas()
                      count()  err(count())
dataset  NJets                             
dataset1 [-inf, 0.0)      0.0      0.000000
         [0.0, 1.0)       1.0      1.000000
         [1.0, 2.0)       1.0      1.000000
         [2.0, 3.0)       1.0      1.000000
         [3.0, 4.0)       1.0      1.000000
         [4.0, 5.0)       1.0      1.000000
         [5.0, 6.0)       1.0      1.000000
         [6.0, 7.0)       1.0      1.000000
         [7.0, 8.0)       1.0      1.000000
         [8.0, 9.0)       1.0      1.000000
         [9.0, 10.0)      1.0      1.000000
         [10.0, inf)     90.0      9.486833
         {NaN}            0.0      0.000000

My version is 1.2.0 also. What's your Numpy version?

>>> histbook.__version__
'1.2.0'
>>> np.__version__
'1.14.4'

This might imply a higher minimum Numpy version: right now I say install_requires = ["numpy>=1.8.0"].

@imandr
Copy link
Author

imandr commented Jun 22, 2018

import numpy
numpy.version
'1.11.3'

@jpivarski
Copy link
Member

I found Numpy 1.11.3 on lxplus and tried it there. Indeed:

>>> numpy.full(10, "hello")
/afs/cern.ch/user/p/pivarski/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full(10, 'hello') will return an array of dtype('<U5')
  format(shape, fill_value, array(fill_value).dtype), FutureWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/afs/cern.ch/user/p/pivarski/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 302, in full
    multiarray.copyto(a, fill_value, casting='unsafe')
ValueError: could not convert string to float: 'hello'

I'll keep the Numpy requirement minimum where it is and put a work-around in this bit of code. I think this just affects everywhere numpy.full is called with a string argument.

@imandr
Copy link
Author

imandr commented Jun 22, 2018

I upgraded numpy to 1.14.5 and now the error is different:

import histbook
from histbook import Hist, bin, groupby
import numpy as np

print np.__version__

h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
h.fill("dataset1", NJets=np.arange(100))
1.14.5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-84a0ed9de926> in <module>()
      6 
      7 h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
----> 8 h.fill("dataset1", NJets=np.arange(100))
      9 

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.2.0-py2.7.egg/histbook/hist.pyc in fill(self, arrays, **more)
    374 
    375             self._prefill()
--> 376             length = self._fill(arrays)
    377             self._postfill(arrays, length)
    378 

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.2.0-py2.7.egg/histbook/fill.pyc in _fill(self, arrays)
    118                 else:
    119                     try:
--> 120                         array = arrays[instruction.extern.value]
    121                     except KeyError:
    122                         if instruction.extern.value in histbook.expr.Expr.maybeconstants:

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.2.0-py2.7.egg/histbook/util/__init__.pyc in __getitem__(self, n)
     38             return self._two[n]    # and it has precedence
     39         else:
---> 40             return self._one[n]    # self._one might only have __getitem__

TypeError: string indices must be integers, not str

Sorry. my bad. Call to h.fill() is wrong.

It works fine with numpy 1.14.5

@jpivarski
Copy link
Member

I just patched master so that it works as shown in my comment in Numpy 1.14.4 and 1.11.3.

But your problem is that you're missing a keyword argument:

h.fill("dataset1", NJets=np.arange(100))

instead of

h.fill(dataset="dataset1", NJets=np.arange(100))

It's trying to interpret the positional (non-keyword) argument as a dict of arrays, a Pandas DataFrame, or a Spark DataFrame. I wanted to be loose about it so that anything with df["column"] → array would work in that slot (i.e. duck typing). But if the "df" is a string, it applies square-bracket-string-close-square-bracket to what is actually a string, hence "string indices must be integers, not str."

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants