fp.to_dataframe is slow #7

cbouy · 2021-01-12T15:51:37Z

Converting the fingerprint to a dataframe is slow, especially when there are a lot of columns.

Example code:

u = mda.Universe(plf.datafiles.TOP, plf.datafiles.TRAJ)
tm3 = u.select_atoms("resid 119:152")
prot = u.select_atoms("protein and not group tm3", tm3=tm3)
fp = plf.Fingerprint()
fp.run(u.trajectory[::10], tm3, prot)

%load_ext line_profiler

%lprun -f plf.to_dataframe plf.to_dataframe(fp.ifp, fp.interactions.keys())

Here is the line-by-line execution time:

Timer unit: 1e-06 s

Total time: 20.484 s
File: /home/cedric/projects/ProLIF/prolif/utils.py
Function: to_dataframe at line 131

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   131                                           def to_dataframe(ifp, interactions, index_col="Frame", dtype=None,
   132                                                            drop_empty=True):
...
   174         1          4.0      4.0      0.0      n_interactions = len(interactions)
   175         1     197456.0 197456.0      1.0      data = pd.DataFrame(ifp)
   176         1       5372.0   5372.0      0.0      data.set_index(index_col, inplace=True)
   177                                               # sort columns by ResidueIds and interaction
   178         1      20973.0  20973.0      0.1      data.sort_index(axis=1, inplace=True)
   179         1       5711.0   5711.0      0.0      data.columns = pd.MultiIndex.from_tuples(data.columns)
   180                                               # check if dealing with single values or atom indices
   181         1         42.0     42.0      0.0      value = data.values[0, 0][0]
   182         1        120.0    120.0      0.0      is_iterable = isinstance(value, Iterable)
   183                                               # replace NaNs with appropriate values
   184         1          1.0      1.0      0.0      empty_value = dtype(False) if dtype else False
   185         1          1.0      1.0      0.0      fill_value = [None, None] if is_iterable else empty_value
   186         1     190775.0 190775.0      0.9      data = data.applymap(lambda x: [fill_value] * n_interactions
   187                                                                    if x is np.nan else x)
   188                                               # split each bitvector in separate columns for each interaction
   189         1        519.0    519.0      0.0      df = pd.DataFrame()
   190       624       2653.0      4.3      0.0      for l, p in data.columns:
   191       623       8270.0     13.3      0.0          cols = [(str(l), str(p), i) for i in interactions]
   192       623   20005669.0  32111.8     97.7          df[cols] = data[(l, p)].apply(pd.Series)
   193         2       2545.0   1272.5      0.0      df.columns = pd.MultiIndex.from_tuples(
   194         1          1.0      1.0      0.0          df.columns, names=["ligand", "protein", "interaction"])
   195         1          1.0      1.0      0.0      if dtype:
   196                                                   df = df.astype(dtype)
   197         1          1.0      1.0      0.0      if drop_empty:
   198         1          1.0      1.0      0.0          if is_iterable:
   199                                                       mask = df.apply(lambda s:
   200                                                                       ~(s.map(tuple).isin([(None, None)]).all()), axis=0)
   201                                                   else:
   202         1      39960.0  39960.0      0.2              mask = (df != empty_value).any(axis=0)
   203         1       3973.0   3973.0      0.0          df = df.loc[:, mask]
   204         1          2.0      2.0      0.0      return df

The problem comes from using data[(l, p)].apply(pd.Series) which is known to be slow

The text was updated successfully, but these errors were encountered:

cbouy · 2021-01-12T17:47:48Z

Using pd.DataFrame(s.tolist(), index=s.index) instead of s.apply(pd.Series) is slightly faster but still slow...

- Updated the repr method of `ResidueId` so that it isn't confused with a string anymore - Improved the speed of the `to_dataframe` function, from 8.5s to 150ms (TM3-GPCR PPI example)

cbouy self-assigned this Jan 12, 2021

cbouy added the enhancement New feature or request label Jan 12, 2021

cbouy mentioned this issue Jan 13, 2021

Fix Issue #7: fp.to_dataframe is slow #9

Merged

cbouy closed this as completed Jan 13, 2021

rwxayheee mentioned this issue Aug 26, 2021

Aminium (R3NH+) groups as HBDonor #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp.to_dataframe is slow #7

fp.to_dataframe is slow #7

cbouy commented Jan 12, 2021

cbouy commented Jan 12, 2021

fp.to_dataframe is slow #7

fp.to_dataframe is slow #7

Comments

cbouy commented Jan 12, 2021

cbouy commented Jan 12, 2021