Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Speed up indexing by bulk committing to database #1013

Merged
merged 11 commits into from
Aug 7, 2023

Conversation

adelavega
Copy link
Collaborator

I did some line profiling on BIDSLayoutIndexer.__init__ and found the following;

In the single dataset I tested, around 80% of time is spent indexing meta-data, and 20% is indexing files.

In both cases, there were some easy-to-fix and and slightly embarrassing issues, which led to a 2x speed up (12s to 6s):

  • Bulk saving of SQL objects, instead of adding and commiting
  • Not using eval to convert dtype from string to type, instead use a predefined dict
  • The big one: we were reading JSON sidecars, and then to test if they were truly json later in the pipeline we were serializing to string, and then re-serializing to Python objects. I made the assumping that if a list or dict makes it that far in the process, it's "JSON" meta-data.

At this point, I would say that ~35-50% of meta-data indexing time is SQLAlchemy overhead (creating all the Tag objects and adding to db).

A good percentage of the remaining time is applying the inheritance principle, and this could probably be sped up substantially.

We first ingest all files into the db (prior to meta-data ingestion) then on a second pass of all files build a mapping of suffixes/extensions to their corresponding JSON files, then on a third pass of all files use that mapping to find all candidate sidecars for a file and merge them.

No single line is particularly slow but given that I/O of JSON files is only around 6% of the time, it could be improved.

@adelavega adelavega requested a review from effigies August 3, 2023 01:10
@adelavega
Copy link
Collaborator Author

adelavega commented Aug 3, 2023

CC: @clane9 thanks for the inspiration. Would you mind running your benchmarks on this branch?

@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Patch coverage: 91.89% and project coverage change: +0.05% 🎉

Comparison is base (65c78cd) 83.42% compared to head (b859764) 83.48%.

❗ Current head b859764 differs from pull request most recent head d000552. Consider uploading reports for the commit d000552 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1013      +/-   ##
==========================================
+ Coverage   83.42%   83.48%   +0.05%     
==========================================
  Files          38       38              
  Lines        4289     4310      +21     
  Branches     1099     1098       -1     
==========================================
+ Hits         3578     3598      +20     
+ Misses        515      514       -1     
- Partials      196      198       +2     
Files Changed Coverage Δ
bids/layout/models.py 93.08% <87.09%> (+0.14%) ⬆️
bids/layout/index.py 86.98% <95.34%> (+0.32%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bids/layout/models.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so the problem that the json.dumps()/json.loads() cycle was solving was that we coerce to str. Why are we coercing to str?

Will test locally before committing this one...

bids/layout/models.py Outdated Show resolved Hide resolved
bids/layout/models.py Show resolved Hide resolved
@effigies
Copy link
Collaborator

effigies commented Aug 3, 2023

Nope. The issue is that _value needs to be a str because that's what's stored in sqlite, and we need to either use json.dumps()/json.loads() or str()/eval() to get Python objects back.

@adelavega
Copy link
Collaborator Author

adelavega commented Aug 3, 2023

Ah. Here's the problem though--- on index it's creating those Tag objects, which trigger the full load dumps loads cycle, when the last loads cycle is unnecessary since we're only creating those objects to store to db, not to present to the User to access to variable.

So at the minimum we could probably save ourselves the final loads cycle if we can figure out how to skip it upon ingestion (and only do it when querying files and creating the Tag objects).

In SQLAlchemy this would be an easy fix becuase there's a JSONB type column that can take JSON directly.

@adelavega
Copy link
Collaborator Author

Great, your fix @effigies still leads keeps the performance gains

@effigies
Copy link
Collaborator

effigies commented Aug 3, 2023

Cool. You may want to do a logic check. I think the tests would have caught failures, but I was a little rushed and may have unnecessarily duplicated work.

Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions on the bulk save aggregation. Not sure if they're all good.

Comment on lines 229 to 231
dir_fo = self._index_dir(d, config, force=force)
if dir_fo:
all_file_objs += dir_fo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is accounting for L188 returning None. Now that it returns a list, have it return an empty list there.

Suggested change
dir_fo = self._index_dir(d, config, force=force)
if dir_fo:
all_file_objs += dir_fo
all_file_objs.extend(self._index_dir(d, config, force=force))


return bf
return bf, file_objs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you not be adding bf to file_objs, since we were calling self.session.add(bf) on L238? Then you can just return file_objs here.

Comment on lines 216 to 217
bf, file_objs = self._index_file(abs_fn, config_entities)
all_file_objs += file_objs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this would become:

Suggested change
bf, file_objs = self._index_file(abs_fn, config_entities)
all_file_objs += file_objs
all_file_objs.extend(self._index_file(abs_fn, config_entities))

Comment on lines 251 to 254
if match_vals:
for _, (ent, val) in match_vals.items():
tag = Tag(bf, ent, str(val), ent._dtype)
self.session.add(tag)
file_objs.append(tag)
Copy link
Collaborator

@effigies effigies Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole thing would be:

file_objs.extend(
    Tag(bf, ent, str(val), ent._dtype)
    for ent, val in match_vals.values()
)

@adelavega
Copy link
Collaborator Author

adelavega commented Aug 3, 2023

I tried two more things, but I'm not sure if they're going to speed things up:

Using urjson instead of json for the serialize/deserialize dance, or just plain python str and eval vs json.dumps. It saves maybe 10% of indexing meta-data at most, so I'm not sure it's worth it, because its less "safe" and urjson is an external library.

The remaining overhead is more tied to making a SQLAlchemy object with attributes for every single tag.

@effigies
Copy link
Collaborator

effigies commented Aug 3, 2023

What are you using to benchmark?

@adelavega
Copy link
Collaborator Author

adelavega commented Aug 3, 2023

I'm using this random dataset: ds002837 and I'm using line_profiler.
Should use %%timeit for more robust stats.

But I found another major speed up, instead of creating objects for every Tag we can just create a dictionary, and SQLAlchemy can do a raw sql insert using Tag's schema using those dictionaries. This cuts down on a ton of overhead and speeds things up by about 50%.

Waiting on tests to pass for that one, but I don't see why it wouldn't work.

@adelavega
Copy link
Collaborator Author

adelavega commented Aug 3, 2023

Actually, it's even faster to just use SQLAlchemy Core for bulk inserts. 1.9s vs 2.3s with bulk_insert_mappings vs 6s on master on ds002837

I think the only con is you dont have sessions, which is only really a problem when there's concurrent connections.

@effigies
Copy link
Collaborator

effigies commented Aug 3, 2023

I think the assumption of most tools with a shared database is already that, if found, it's complete. So this isn't a regression.

@adelavega
Copy link
Collaborator Author

For some reason tests fail and all but 1 configuration. Need to dig into this more, it's possible that some of the guardrails were doing something.

@adelavega
Copy link
Collaborator Author

adelavega commented Aug 3, 2023

Guess what does require Session management? Our tests.

@effigies
Copy link
Collaborator

effigies commented Aug 4, 2023

Small patch for your consideration. They feel like cleanups to me, but if they're distracting, ignore this comment.

diff --git a/bids/layout/index.py b/bids/layout/index.py
index 741e0bfd..62cbec26 100644
--- a/bids/layout/index.py
+++ b/bids/layout/index.py
@@ -186,7 +186,7 @@ class BIDSLayoutIndexer:
 
         # Derivative directories must always be added separately
         if self._layout._root.joinpath('derivatives') in abs_path.parents:
-            return None, None
+            return [], []
 
         config = list(config)  # Shallow copy
 
@@ -230,9 +230,8 @@ class BIDSLayoutIndexer:
             )
             if force is not False:
                 dir_bfs, dir_tag_dicts = self._index_dir(d, config, force=force)
-                if dir_bfs:
-                    all_bfs += dir_bfs
-                    all_tag_dicts += dir_tag_dicts
+                all_bfs += dir_bfs
+                all_tag_dicts += dir_tag_dicts
 
         return all_bfs, all_tag_dicts
 
@@ -250,11 +249,10 @@ class BIDSLayoutIndexer:
                 match_vals[e.name] = (e, m)
 
         # Create Entity <=> BIDSFile mappings
-        tag_dicts = []
-        if match_vals:
-            for _, (ent, val) in match_vals.items():
-                tag = _create_tag_dict(bf, ent, str(val), ent._dtype)
-                tag_dicts.append(tag)
+        tag_dicts = [
+            _create_tag_dict(bf, ent, val, ent._dtype)
+            for ent, val in match_vals.values()
+        ]
 
         return bf, tag_dicts
 
@@ -488,4 +486,4 @@ class BIDSLayoutIndexer:
                 
         self.session.bulk_save_objects(all_objs)
         self.session.bulk_insert_mappings(Tag, all_tag_dicts)
-        self.session.commit()
\ No newline at end of file
+        self.session.commit()

Otherwise, LGTM!

@effigies effigies changed the title 2x indexing speed up ENH: Speed up indexing by bulk committing to database Aug 4, 2023
@adelavega
Copy link
Collaborator Author

Can you push that onto this branch? Looks fine to me

@effigies
Copy link
Collaborator

effigies commented Aug 4, 2023

Done. They went in above my comment, though...

@effigies effigies merged commit ae1bb00 into master Aug 7, 2023
24 checks passed
@effigies effigies deleted the enh/profile_index branch August 7, 2023 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants