Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fp.to_dataframe is slow #7

Closed
cbouy opened this issue Jan 12, 2021 · 1 comment
Closed

fp.to_dataframe is slow #7

cbouy opened this issue Jan 12, 2021 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@cbouy
Copy link
Member

cbouy commented Jan 12, 2021

Converting the fingerprint to a dataframe is slow, especially when there are a lot of columns.

Example code:

u = mda.Universe(plf.datafiles.TOP, plf.datafiles.TRAJ)
tm3 = u.select_atoms("resid 119:152")
prot = u.select_atoms("protein and not group tm3", tm3=tm3)
fp = plf.Fingerprint()
fp.run(u.trajectory[::10], tm3, prot)

%load_ext line_profiler

%lprun -f plf.to_dataframe plf.to_dataframe(fp.ifp, fp.interactions.keys())

Here is the line-by-line execution time:

Timer unit: 1e-06 s

Total time: 20.484 s
File: /home/cedric/projects/ProLIF/prolif/utils.py
Function: to_dataframe at line 131

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   131                                           def to_dataframe(ifp, interactions, index_col="Frame", dtype=None,
   132                                                            drop_empty=True):
...
   174         1          4.0      4.0      0.0      n_interactions = len(interactions)
   175         1     197456.0 197456.0      1.0      data = pd.DataFrame(ifp)
   176         1       5372.0   5372.0      0.0      data.set_index(index_col, inplace=True)
   177                                               # sort columns by ResidueIds and interaction
   178         1      20973.0  20973.0      0.1      data.sort_index(axis=1, inplace=True)
   179         1       5711.0   5711.0      0.0      data.columns = pd.MultiIndex.from_tuples(data.columns)
   180                                               # check if dealing with single values or atom indices
   181         1         42.0     42.0      0.0      value = data.values[0, 0][0]
   182         1        120.0    120.0      0.0      is_iterable = isinstance(value, Iterable)
   183                                               # replace NaNs with appropriate values
   184         1          1.0      1.0      0.0      empty_value = dtype(False) if dtype else False
   185         1          1.0      1.0      0.0      fill_value = [None, None] if is_iterable else empty_value
   186         1     190775.0 190775.0      0.9      data = data.applymap(lambda x: [fill_value] * n_interactions
   187                                                                    if x is np.nan else x)
   188                                               # split each bitvector in separate columns for each interaction
   189         1        519.0    519.0      0.0      df = pd.DataFrame()
   190       624       2653.0      4.3      0.0      for l, p in data.columns:
   191       623       8270.0     13.3      0.0          cols = [(str(l), str(p), i) for i in interactions]
   192       623   20005669.0  32111.8     97.7          df[cols] = data[(l, p)].apply(pd.Series)
   193         2       2545.0   1272.5      0.0      df.columns = pd.MultiIndex.from_tuples(
   194         1          1.0      1.0      0.0          df.columns, names=["ligand", "protein", "interaction"])
   195         1          1.0      1.0      0.0      if dtype:
   196                                                   df = df.astype(dtype)
   197         1          1.0      1.0      0.0      if drop_empty:
   198         1          1.0      1.0      0.0          if is_iterable:
   199                                                       mask = df.apply(lambda s:
   200                                                                       ~(s.map(tuple).isin([(None, None)]).all()), axis=0)
   201                                                   else:
   202         1      39960.0  39960.0      0.2              mask = (df != empty_value).any(axis=0)
   203         1       3973.0   3973.0      0.0          df = df.loc[:, mask]
   204         1          2.0      2.0      0.0      return df

The problem comes from using data[(l, p)].apply(pd.Series) which is known to be slow

@cbouy
Copy link
Member Author

cbouy commented Jan 12, 2021

Using pd.DataFrame(s.tolist(), index=s.index) instead of s.apply(pd.Series) is slightly faster but still slow...

@cbouy cbouy self-assigned this Jan 12, 2021
@cbouy cbouy added the enhancement New feature or request label Jan 12, 2021
cbouy added a commit that referenced this issue Jan 13, 2021
- Updated the repr method of `ResidueId` so that it isn't confused with a string anymore
- Improved the speed of the `to_dataframe` function, from 8.5s to 150ms (TM3-GPCR PPI example)
@cbouy cbouy closed this as completed Jan 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant