-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Speed up indexing by bulk committing to database #1013
Conversation
CC: @clane9 thanks for the inspiration. Would you mind running your benchmarks on this branch? |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1013 +/- ##
==========================================
+ Coverage 83.42% 83.48% +0.05%
==========================================
Files 38 38
Lines 4289 4310 +21
Branches 1099 1098 -1
==========================================
+ Hits 3578 3598 +20
+ Misses 515 514 -1
- Partials 196 198 +2
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so the problem that the json.dumps()
/json.loads()
cycle was solving was that we coerce to str. Why are we coercing to str?
Will test locally before committing this one...
Nope. The issue is that |
Ah. Here's the problem though--- on index it's creating those So at the minimum we could probably save ourselves the final In SQLAlchemy this would be an easy fix becuase there's a |
Great, your fix @effigies still leads keeps the performance gains |
Cool. You may want to do a logic check. I think the tests would have caught failures, but I was a little rushed and may have unnecessarily duplicated work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions on the bulk save aggregation. Not sure if they're all good.
bids/layout/index.py
Outdated
dir_fo = self._index_dir(d, config, force=force) | ||
if dir_fo: | ||
all_file_objs += dir_fo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is accounting for L188 returning None
. Now that it returns a list, have it return an empty list there.
dir_fo = self._index_dir(d, config, force=force) | |
if dir_fo: | |
all_file_objs += dir_fo | |
all_file_objs.extend(self._index_dir(d, config, force=force)) |
bids/layout/index.py
Outdated
|
||
return bf | ||
return bf, file_objs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you not be adding bf
to file_objs
, since we were calling self.session.add(bf)
on L238? Then you can just return file_objs
here.
bids/layout/index.py
Outdated
bf, file_objs = self._index_file(abs_fn, config_entities) | ||
all_file_objs += file_objs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this would become:
bf, file_objs = self._index_file(abs_fn, config_entities) | |
all_file_objs += file_objs | |
all_file_objs.extend(self._index_file(abs_fn, config_entities)) |
bids/layout/index.py
Outdated
if match_vals: | ||
for _, (ent, val) in match_vals.items(): | ||
tag = Tag(bf, ent, str(val), ent._dtype) | ||
self.session.add(tag) | ||
file_objs.append(tag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole thing would be:
file_objs.extend(
Tag(bf, ent, str(val), ent._dtype)
for ent, val in match_vals.values()
)
I tried two more things, but I'm not sure if they're going to speed things up: Using urjson instead of json for the serialize/deserialize dance, or just plain python The remaining overhead is more tied to making a |
What are you using to benchmark? |
I'm using this random dataset: But I found another major speed up, instead of creating objects for every Waiting on tests to pass for that one, but I don't see why it wouldn't work. |
Actually, it's even faster to just use SQLAlchemy Core for bulk inserts. 1.9s vs 2.3s with I think the only con is you dont have sessions, which is only really a problem when there's concurrent connections. |
I think the assumption of most tools with a shared database is already that, if found, it's complete. So this isn't a regression. |
For some reason tests fail and all but 1 configuration. Need to dig into this more, it's possible that some of the guardrails were doing something. |
Guess what does require Session management? Our tests. |
Small patch for your consideration. They feel like cleanups to me, but if they're distracting, ignore this comment. diff --git a/bids/layout/index.py b/bids/layout/index.py
index 741e0bfd..62cbec26 100644
--- a/bids/layout/index.py
+++ b/bids/layout/index.py
@@ -186,7 +186,7 @@ class BIDSLayoutIndexer:
# Derivative directories must always be added separately
if self._layout._root.joinpath('derivatives') in abs_path.parents:
- return None, None
+ return [], []
config = list(config) # Shallow copy
@@ -230,9 +230,8 @@ class BIDSLayoutIndexer:
)
if force is not False:
dir_bfs, dir_tag_dicts = self._index_dir(d, config, force=force)
- if dir_bfs:
- all_bfs += dir_bfs
- all_tag_dicts += dir_tag_dicts
+ all_bfs += dir_bfs
+ all_tag_dicts += dir_tag_dicts
return all_bfs, all_tag_dicts
@@ -250,11 +249,10 @@ class BIDSLayoutIndexer:
match_vals[e.name] = (e, m)
# Create Entity <=> BIDSFile mappings
- tag_dicts = []
- if match_vals:
- for _, (ent, val) in match_vals.items():
- tag = _create_tag_dict(bf, ent, str(val), ent._dtype)
- tag_dicts.append(tag)
+ tag_dicts = [
+ _create_tag_dict(bf, ent, val, ent._dtype)
+ for ent, val in match_vals.values()
+ ]
return bf, tag_dicts
@@ -488,4 +486,4 @@ class BIDSLayoutIndexer:
self.session.bulk_save_objects(all_objs)
self.session.bulk_insert_mappings(Tag, all_tag_dicts)
- self.session.commit()
\ No newline at end of file
+ self.session.commit() Otherwise, LGTM! |
Can you push that onto this branch? Looks fine to me |
Done. They went in above my comment, though... |
I did some line profiling on
BIDSLayoutIndexer.__init__
and found the following;In the single dataset I tested, around 80% of time is spent indexing meta-data, and 20% is indexing files.
In both cases, there were some easy-to-fix and and slightly embarrassing issues, which led to a 2x speed up (12s to 6s):
eval
to convert dtype from string to type, instead use a predefined dictAt this point, I would say that ~35-50% of meta-data indexing time is SQLAlchemy overhead (creating all the
Tag
objects and adding to db).A good percentage of the remaining time is applying the inheritance principle, and this could probably be sped up substantially.
We first ingest all files into the db (prior to meta-data ingestion) then on a second pass of all files build a mapping of suffixes/extensions to their corresponding JSON files, then on a third pass of all files use that mapping to find all candidate sidecars for a file and merge them.
No single line is particularly slow but given that I/O of JSON files is only around 6% of the time, it could be improved.