Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New branch focussing on using in-place memory operations where possible? #3

Open
JobLeonard opened this issue May 10, 2016 · 8 comments
Assignees

Comments

@JobLeonard
Copy link

JobLeonard commented May 10, 2016

Hey Gio/Sten,

Thanks for writing this, it explains a lot to me that the four lines of maths-code in the SPIN paper did not.

However, I noticed that a lot, if not all, of the matrix operations in the code produce new matrices that are only used once. That is wasteful, especially if the operations are simple enough to be memory bound (multiplication, addition). The effect is also stronger the bigger the matrix used, which I guess is true for the new dataset.

For example, the code is full of lines such as:

    tmpmat = dist_matrix[indexes,:] 
    tmpmat = tmpmat[:,indexes]

This is in lines 130-131, and inside an inner loop. Per the documentation:

Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).

Warning

The definition of advanced indexing means that x[(1,2,3),] is fundamentally different than x[(1,2,3)]. The latter is equivalent to x[1,2,3] which will trigger basic selection while the former will trigger advanced indexing. Be sure to understand why this is occurs.

Also recognize that x[[1,2,3]] will trigger advanced indexing, whereas x[[1,2,slice(None)]] will trigger basic slicing.

http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing

Since I'm not entirely sure what those lines do at the moment I'm not sure how to rewrite them, but I am sure they can be rewritten to make good use of numpy views and reduce memory allocation/garbage collection.

Another example is on line 97:

mismatch_score = dot(dist_matrix, weights_mat)

Per the numpy docs this produces a new matrix each time - if we pre-allocate mismatch_score and reuse it (assuming the dimensions are constant, and they appear to be) we can do:

dot(dist_matrix, weights_mat, mismatch_score)

This would save the repeated memory allocation (and eventual garbage collection); since this function is also called inside an inner loop and the comment above it complains that it gets quite slow, this might have quite a big effect.

One last example from lines 62-66:

sqd = (arange(1,mat_size+1)[newaxis,:] - arange(1,mat_size+1)[:,newaxis])**2
#make the distance relative to the mat_size
norm_sqd = sqd/wid
#evaluate a normal pdf
weights_mat = exp(-norm_sqd/mat_size)

I haven't tested this, but I based on the documentation I think this can be rewritten to:

sqd = arange(1,mat_size+1)[:,newaxis])**2
# in-place subtraction, note that I've hoisted the unary minus,
# since -(a - b) == b - a
sqd -= (arange(1,mat_size+1)[newaxis,:] 
# multiplication is much faster than division, at the cost of negligible rounding error
sqd *= 1.0/(wid*mat_size)
#make the distance relative to the mat_size and evaluate a normal pdf
weights_mat = exp(sqd, sqd)

This would reduce the number of allocated matrices from six to two. The biggest downside would be that the new code is less readable - so I'd annotate it to explain the unoptimised operations in the comments.

If you think it's a good use of my time I would like to make an "in-place operations" branch where I'll try to optimise all of the low-hanging fruit here - I suspect it could speed up the code quite a bit, and even if it turns out not to make too much of a difference it will help me get a feeling for the BackSPIN algorithm.

@gioelelm
Copy link
Contributor

Thank you for the tip, sure go ahead!

@JobLeonard
Copy link
Author

Note that this is a long-term thing, once Sten, Peter and I got some of the more pressing Loom issues sorted out ;)

Here's an article with optimisation tips for Numba and Numpy, might contain some useful stuff:

https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Meets_Julia_Micro_Performance?lang=en

@JobLeonard JobLeonard self-assigned this May 13, 2016
@JobLeonard
Copy link
Author

So I have been educating myself on the various Profiler tools for python, to see where the bottlenecks of the current Backspin implementation might be.

First, pycallgraph produces pretty pictures that make it easier to follow the way the calls flow, also very useful for a high-level overview. Below is the result of running BackSPIN through this tool:

pycallgraph

That's a very wide graph with so many irrelevant function calls, so I'll lift out the main bottlenecks:

main-profile
featureselection-profile
ceftools-profile
spin-profile

Feature-selection and CEF-tools eat up a good chunk of the time - perhaps something worth looking into as well.

What's interesting is that backSPIN makes one direct call to SPIN - the first one, and then _divide_to_2and_resort make 23. The latter only takes up 20 seconds out of the total 50 spent in SPIN, which tells us that that first SPIN call take up 30 seconds. In the original SPIN paper the authors claim that their algorithm "converges somewhere between O(n²) and O(n³)", so that makes sense.

@JobLeonard
Copy link
Author

So now we got the big bottlenecks, let's look at those functions in detail. This is a good introduction article, on profiling a python program line-by-line, although I couldn't get meaningful results out of the memory profiler.

This text file has the full results of running the kernprof CPU profiler on my laptop[0]. Below I'll copy fit_CV, readCEF, _calc_weights_matrix and _sort_neighbourhood, since they are the big bottlenecks:

Function: fit_CV at line 541
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   568         1          315    315.0      0.0      log2_m = log2(mu)
   569         1          271    271.0      0.0      log2_cv = log2(cv)
   570
   571         1            5      5.0      0.0      if len(mu)>1000 and 'bin' in fit_method:
   572                                                   #histogram with 30 bins
   573                                                   n,xi = histogram(log2_m,30)
   574                                                   med_n = percentile(n,50)
   575                                                   for i in range(0,len(n)):
   576                                                       # index of genes within the ith bin
   577                                                       ind = where( (log2_m >= xi[i]) & (log2_m < xi[i+1]) )[0]
   578                                                       if len(ind)>med_n:
   579                                                           #Downsample if count is more than median
   580                                                           ind = ind[random.permutation(len(ind))]
   581                                                           ind = ind[:len(ind)-med_n]
   582                                                           mask = ones(len(log2_m), dtype=bool)
   583                                                           mask[ind] = False
   584                                                           log2_m = log2_m[mask]
   585                                                           log2_cv = log2_cv[mask]
   586                                                       elif (around(med_n/len(ind))>1) and (len(ind)>5):
   587                                                           #Duplicate if count is less than median
   588                                                           log2_m = r_[ log2_m, tile(log2_m[ind], around(med_n/len(ind))-1) ]
   589                                                           log2_cv = r_[ log2_cv, tile(log2_cv[ind], around(med_n/len(ind))-1) ]
   590                                               else:
   591         1            1      1.0      0.0          if 'bin' in fit_method:
   592                                                       print 'More than 1000 input feature needed for bin correction.'
   593                                                   pass
   594
   595         1            2      2.0      0.0      if 'SVR' in fit_method:
   596         1            2      2.0      0.0          try:
   597         1       397519 397519.0      1.2              from sklearn.svm import SVR
   598         1            4      4.0      0.0              if svr_gamma == 'auto':
   599                                                           svr_gamma = 1000./len(mu)
   600                                                       #Fit the Support Vector Regression
   601         1           28     28.0      0.0              clf = SVR(gamma=svr_gamma)
   602         1     27747287 27747287.0     83.8              clf.fit(log2_m[:,newaxis], log2_cv)
   603         1            3      3.0      0.0              fitted_fun = clf.predict
   604         1      4942606 4942606.0     14.9              score = log2(cv) - fitted_fun(log2(mu)[:,newaxis])
   605         1            4      4.0      0.0              params = None
   606                                                       #The coordinates of the fitted curve
   607         1         2451   2451.0      0.0              mu_linspace = linspace(min(log2_m),max(log2_m))
   608         1        16059  16059.0      0.0              cv_fit = fitted_fun(mu_linspace[:,newaxis])
   609         1            4      4.0      0.0              return score, mu_linspace, cv_fit , params

File: Cef_tools.py
Function: readCEF at line 90
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    90                                              @profile
    91                                              def readCEF(self, filepath, matrix_dtype = 'auto'):
    92                                                  #Delete all the stored information
    93         1           12     12.0      0.0         self.__init__()
    94                                                  #Start parsing
    95         1           16     16.0      0.0         with open(filepath, 'rbU') as fin:
    96                                                      # Read cef file first line
    97                                                      self.header, self.row_attr, self.col_attr, self.rows,\
    98         1           35     35.0      0.0             self.cols, self.flags = fin.readline().rstrip('\n').split('\t')[1:7]
    99         1            4      4.0      0.0             self.header = int(self.header)
   100         1            2      2.0      0.0             self.row_attr = int( self.row_attr )
   101         1            2      2.0      0.0             self.col_attr = int(self.col_attr)
   102         1            2      2.0      0.0             self.rows = int(self.rows)
   103         1            2      2.0      0.0             self.cols = int(self.cols)
   104         1            2      2.0      0.0             self.flags = int(self.flags)
   105         4            8      2.0      0.0             self.row_attr_values = [[] for _ in xrange(self.row_attr)]
   106                                                      # Read header
   107         3            4      1.3      0.0             for i in range(self.header):
   108         2           44     22.0      0.0                 name, value = fin.readline().rstrip('\n').split('\t')[:2]
   109         2            2      1.0      0.0                 self.header_names.append(name)
   110         2            2      1.0      0.0                 self.header_values.append(value)
   111                                                      # Read col attr
   112        11           13      1.2      0.0             for i in range(self.col_attr):
   113        10          641     64.1      0.0                 line_col_attr = fin.readline().rstrip('\n').split('\t')[self.row_attr:]
   114        10           15      1.5      0.0                 self.col_attr_names.append( line_col_attr[0] )
   115        10           34      3.4      0.0                 self.col_attr_values.append( line_col_attr[1:] )
   116                                                      #Read row attr and matrix
   117         1           26     26.0      0.0             self.row_attr_names += fin.readline().rstrip('\n').split('\t')[:self.row_attr]
   118     19973        19748      1.0      0.1             for _ in xrange(self.rows):
   119     19972       561393     28.1      2.4                 linelist = fin.readline().rstrip('\n').split('\t')
   120     79888       304938      3.8      1.3                 for n, entry in enumerate( linelist[:self.row_attr] ):
   121     59916        71511      1.2      0.3                     self.row_attr_values[n].append( entry )
   122     19972        26042      1.3      0.1                 if matrix_dtype == 'auto':
   123         1          371    371.0      0.0                     if sum(('.' in i) or ('e' in i) for i in linelist[self.row_attr+1:]) != 0:
   124                                                                  matrix_dtype = float
   125                                                              else:
   126         1            1      1.0      0.0                         matrix_dtype = int
   127     19972        18848      0.9      0.1                 try:
   128  16397012     22452690      1.4     95.7                     self.matrix.append( [matrix_dtype(el) for el in linelist[self.row_attr+1:] ])
   129                                                          except ValueError:
   130                                                              print repr(el), ' is invalid'


Function: _sort_neighbourhood at line 77
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    92      2820         3444      1.2      0.0      assert wid > 0, 'Parameter wid < 0 is not allowed'
    93      2820         5937      2.1      0.0      mat_size = dist_matrix.shape[0]
    94                                               #assert mat_size>2, 'Matrix is too small to be sorted'
    95      2820     20713204   7345.1     67.6      weights_mat = _calc_weights_matrix(mat_size, wid)
    96                                               #Calculate the dot product (can be very slow for big mat_size)
    97      2820      9005557   3193.5     29.4      mismatch_score = dot(dist_matrix, weights_mat)
    98      2820       732586    259.8      2.4      energy, target_permutation = mismatch_score.min(1), mismatch_score.argmin(1)
    99      2820        90289     32.0      0.3      max_energy = max(energy)
   100                                               #Avoid points that have the same target_permutation value
   101      2820        66102     23.4      0.2      sort_score = target_permutation - 0.1 * sign( (mat_size/2 - target_permutation) ) * energy/max_energy
   102                                               #sort_score = target_permutation - 0.1 * sign( 1-2*(int(1000*energy/max_energy) % 2) ) * energy/max_energy # Alternative
   103                                               # Sorting the matrix
   104      2820        30351     10.8      0.1      sorted_ind = sort_score.argsort(0)[::-1]
   105      2820         2039      0.7      0.0      return sorted_ind

Function: _calc_weights_matrix at line 46
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    61                                               #calculate square distance from the diagonal
    62      2820      1353453    479.9      6.6      sqd = (arange(1,mat_size+1)[newaxis,:] - arange(1,mat_size+1)[:,newaxis])**2
    63                                               #make the distance relative to the mat_size
    64      2820      3373825   1196.4     16.3      norm_sqd = sqd/wid
    65                                               #evaluate a normal pdf
    66      2820      9885674   3505.6     47.8      weights_mat = exp(-norm_sqd/mat_size)
    67                                               #avoid useless precision that would slow down the matrix multiplication
    68      2820       340370    120.7      1.6      weights_mat -= 1e-6
    69      2820       445687    158.0      2.2      weights_mat[weights_mat<0] = 0
    70                                               #normalize row and column sum
    71      2820      1517614    538.2      7.3      weights_mat /= sum(weights_mat,0)[newaxis,:]
    72      2820      1505751    534.0      7.3      weights_mat /= sum(weights_mat,1)[:, newaxis]
    73                                               #fix asimmetries
    74      2820      2235932    792.9     10.8      weights_mat = (weights_mat + weights_mat.T) / 2.
    75      2820         3367      1.2      0.0      return weights_mat

From this we learn the following:

  • in fit_CV, 83% of the time is spent in clf.fit(log2_m[:,newaxis], log2_cv). Optimise that and we'll get a huge boost
  • in read_CEF, 95% (!) of the time is spent on the line self.matrix.append( [matrix_dtype(el) for el in linelist[self.row_attr+1:] ]). Optimise that, boost!
  • in _sort_neighbourhood, despite the comment about the dot-product eating up a lot of time, mismatch_score = dot(dist_matrix, weights_mat) is only 1/3 of the bottleneck; the call to _calc_weights_matrix(mat_size, wid) is much more significant
  • in _calc_weights_matrix(mat_size, wid), the calls are fairly evenly divided, except for weights_mat = exp(-norm_sqd/mat_size) which takes up half of the time in the function.

So now we know where to start looking for faster alternatives!

@JobLeonard
Copy link
Author

First attempts:

  • re-order and replace maths operations in _calc_weights with faster alternatives (will give slightly different rounding results)
  • try out place(weights_mat, weights_mat < 0, 0), which is supposedly faster than boolean masking
  • inline _sort_neighbourhood, and reuse mismatch_score
Function: _calc_weights_matrix at line 60
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    75                                               #calculate square distance from the diagonal
    76      2668       731609    274.2      2.7      sqd = arange(1,mat_size+1)[newaxis,:] - arange(1,mat_size+1)[:,newaxis]
    77      2668       435643    163.3      1.6      sqd **= 2
    78                                           
    79                                               ##make the distance relative to the mat_size
    80                                               #norm_sqd = sqd/wid
    81                                               ##evaluate a normal pdf
    82                                               #weights_mat = exp(-norm_sqd/mat_size)
    83      2668     12017917   4504.5     45.0      weights_mat = exp(sqd / (-wid * mat_size))
    84                                           
    85                                               #avoid useless precision that would slow down the matrix multiplication
    86      2668       427798    160.3      1.6      weights_mat -= 1e-6
    87                                               #weights_mat[weights_mat<0] = 0
    88      2668      8695843   3259.3     32.6      place(weights_mat, weights_mat < 0, 0)
    89                                           
    90                                               #normalize row and column sum
    91                                               # weights_mat /= sum(weights_mat,0)[newaxis,:]
    92                                               # weights_mat /= sum(weights_mat,1)[:, newaxis]
    93      2668      2611073    978.7      9.8      weights_mat /= sum(weights_mat,0)[newaxis,:] * sum(weights_mat,1)[:, newaxis]
    94                                           
    95                                               #fix asymmetries
    96                                               #weights_mat = (weights_mat + weights_mat.T) / 2.
    97      2668      1438891    539.3      5.4      weights_mat += weights_mat.T
    98      2668       348823    130.7      1.3      weights_mat *= 0.5
    99      2668         3432      1.3      0.0      return weights_mat

Function: sort_mat_by_neighborhood at line 131
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   161                                               # manual inlining of _sort_neighbourhood
   162                                               # to maximise array re-use
   163                                           
   164                                               # original indexes
   165       324          843      2.6      0.0      assert wid > 0, 'Parameter wid < 0 is not allowed'
   166       324         3786     11.7      0.0      indexes = arange(dist_matrix.shape[0])
   167       324          344      1.1      0.0      mismatch_score = None
   168      2992         5342      1.8      0.0      for i in range(times):
   169                                                   #sort the sitance matrix according the previous iteration
   170      2668       590859    221.5      1.4          tmpmat = dist_matrix[indexes,:]
   171      2668      1866980    699.8      4.6          tmpmat = tmpmat[:,indexes]
   172                                           
   173      2668        10232      3.8      0.0          mat_size = tmpmat.shape[0]
   174                                                   #assert mat_size>2, 'Matrix is too small to be sorted'
   175      2668     26782993  10038.6     65.6          weights_mat = _calc_weights_matrix(mat_size, wid)
   176                                                   #Calculate the dot product (can be very slow for big mat_size)
   177                                                   # hackish trick: first loop mismatch_score will be None, so dot()
   178                                                   # creates a new ndarray. After that it will reuse this array.
   179      2668     10463764   3922.0     25.6          mismatch_score = dot(tmpmat, weights_mat, mismatch_score)
   180                                           
   181      2668       509511    191.0      1.2          target_permutation = mismatch_score.argmin(1)
   182      2668       368806    138.2      0.9          energy = mismatch_score.min(1)
   183      2668       103991     39.0      0.3          max_energy = max(energy)
   184                                           
   185                                                   # #Avoid points that have the same target_permutation value
   186                                                   # sort_score = target_permutation - 0.1 * sign( (mat_size/2 - target_permutation) ) * energy/max_energy
   187                                                   # #sort_score = target_permutation - 0.1 * sign( 1-2*(int(1000*energy/max_energy) % 2) ) * energy/max_energy # Alternative
   188                                                   # Sorting the matrix
   189                                                   # sorted_ind = sort_score.argsort(0)[::-1]
   190                                           
   191      2668         4345      1.6      0.0          mat_size /= 2
   192      2668        40461     15.2      0.1          t = sign( (mat_size - target_permutation) )
   193      2668        23923      9.0      0.1          t *= energy/max_energy
   194      2668         8434      3.2      0.0          t *= 0.1
   195      2668        10900      4.1      0.0          sort_score = target_permutation - t
   196      2668        49367     18.5      0.1          sorted_ind = sort_score.argsort(0)[::-1]
   197                                                   #resort the original indexes
   198      2668         8394      3.1      0.0          indexes = indexes[sorted_ind]
   199       324          308      1.0      0.0      return indexes

Results:

  • re-ordering may have helped a tiny bit, but the exp() call is still the big bottleneck
  • inlining did not produce faster results!
  • place() seems slower!

@JobLeonard
Copy link
Author

JobLeonard commented May 25, 2016

TLDR: Python implementation of BackSPIN algorithm is 40% faster!

(not counting feature selection and reading in the CEF files)

Important: the results will be different after this rewrite!

The reason is that I use a basic floating point optimisation trick where instead of dividing by a matrix by a scalar value x, I first do y = 1/x and then multiply by y - because floating point multiplication is much faster than division, this speeds things up. However, it has slightly different rounding - however, given the error margins used here (that whole cut-off business at -1e-6) I expect that it won't negatively affect the results. Please double-check however!

_calc_weights is profiled as taking 15s instead of 20s, making backSPIN as a whole run in about 30s down from 50s. This is probably as good as this gets for now, unless we can really speed up the call to exp in _calc_weights, or that dot-product line in _sort_neighbourhood:

Total time: 15.0316 s
Function: _calc_weights_matrix at line 60
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    75                                               #calculate square distance from the diagonal
    76      2668       713584    267.5      4.7      sqd = arange(1,mat_size+1)[newaxis,:] - arange(1,mat_size+1)[:,newaxis]
    77      2668       437468    164.0      2.9      sqd **= 2
    78                                               ##make the distance relative to the mat_size
    79      2668         6295      2.4      0.0      divisor = 1.0 /  (-wid * mat_size)
    80      2668      8976078   3364.3     59.7      weights_mat = exp(sqd * divisor)
    81                                               #avoid useless precision that would slow down the matrix multiplication
    82      2668       391093    146.6      2.6      weights_mat -= 1e-6
    83      2668       471917    176.9      3.1      weights_mat[weights_mat<0] = 0
    84                                               #normalize row and column sum
    85      2668      2476751    928.3     16.5      weights_mat /= sum(weights_mat,0)[newaxis,:] * sum(weights_mat,1)[:, newaxis]
    86                                               #fix asymmetries
    87      2668      1199511    449.6      8.0      weights_mat += weights_mat.T
    88      2668       355904    133.4      2.4      weights_mat *= 0.5
    89      2668         2982      1.1      0.0      return weights_mat

Total time: 25.528 s
Function: _sort_neighbourhood at line 91
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   106      2668         4830      1.8      0.0      assert wid > 0, 'Parameter wid < 0 is not allowed'
   107      2668         6573      2.5      0.0      mat_size = dist_matrix.shape[0]
   108                                               #assert mat_size>2, 'Matrix is too small to be sorted'
   109      2668     15086467   5654.6     59.1      weights_mat = _calc_weights_matrix(mat_size, wid)
   110                                               #Calculate the dot product (can be very slow for big mat_size)
   111      2668      9399555   3523.1     36.8      mismatch_score = dot(dist_matrix, weights_mat)
   112      2668       821260    307.8      3.2      energy, target_permutation = mismatch_score.min(1), mismatch_score.argmin(1)
   113      2668        91982     34.5      0.4      max_energy = max(energy)
   114                                               #Avoid points that have the same target_permutation value
   115      2668        70708     26.5      0.3      sort_score = target_permutation - 0.1 * sign( (mat_size/2 - target_permutation) ) * energy/max_energy
   116                                               #sort_score = target_permutation - 0.1 * sign( 1-2*(int(1000*energy/max_energy) % 2) ) * energy/max_energy # Alternative
   117                                               # Sorting the matrix
   118      2668        44466     16.7      0.2      sorted_ind = sort_score.argsort(0)[::-1]
   119      2668         2130      0.8      0.0      return sorted_ind

spin_optimized-profile

Next up, I'll look at read_CEF

@JobLeonard
Copy link
Author

JobLeonard commented May 25, 2016

This one magic function in Numpy might fix the readCEF troubles:

http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html#numpy.loadtxt

EDIT: bit of digging shows that we want to use pandas.read_csv, as it's around ten times faster and uses one third the memory. Example of using it can be found here

@JobLeonard
Copy link
Author

@gioelelm, I just had a look at this again because of that image unshredding; could you do a quick test if my version still produces sensible results and if so, merge it? I already pulled in your changes to use centre of mass instead of means, so there should be no conflicts.

In my tests it's 20% faster so probably worth the five minutes required to check and merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants