ENH: Speed up indexing by bulk committing to database #1013

adelavega · 2023-08-03T01:10:09Z

I did some line profiling on BIDSLayoutIndexer.__init__ and found the following;

In the single dataset I tested, around 80% of time is spent indexing meta-data, and 20% is indexing files.

In both cases, there were some easy-to-fix and and slightly embarrassing issues, which led to a 2x speed up (12s to 6s):

Bulk saving of SQL objects, instead of adding and commiting
Not using eval to convert dtype from string to type, instead use a predefined dict
The big one: we were reading JSON sidecars, and then to test if they were truly json later in the pipeline we were serializing to string, and then re-serializing to Python objects. I made the assumping that if a list or dict makes it that far in the process, it's "JSON" meta-data.

At this point, I would say that ~35-50% of meta-data indexing time is SQLAlchemy overhead (creating all the Tag objects and adding to db).

A good percentage of the remaining time is applying the inheritance principle, and this could probably be sped up substantially.

We first ingest all files into the db (prior to meta-data ingestion) then on a second pass of all files build a mapping of suffixes/extensions to their corresponding JSON files, then on a third pass of all files use that mapping to find all candidate sidecars for a file and merge them.

No single line is particularly slow but given that I/O of JSON files is only around 6% of the time, it could be improved.

adelavega · 2023-08-03T01:12:04Z

CC: @clane9 thanks for the inspiration. Would you mind running your benchmarks on this branch?

codecov · 2023-08-03T01:15:40Z

Codecov Report

Patch coverage: 91.89% and project coverage change: +0.05% 🎉

Comparison is base (65c78cd) 83.42% compared to head (b859764) 83.48%.

❗ Current head b859764 differs from pull request most recent head d000552. Consider uploading reports for the commit d000552 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1013      +/-   ##
==========================================
+ Coverage   83.42%   83.48%   +0.05%     
==========================================
  Files          38       38              
  Lines        4289     4310      +21     
  Branches     1099     1098       -1     
==========================================
+ Hits         3578     3598      +20     
+ Misses        515      514       -1     
- Partials      196      198       +2

Files Changed	Coverage Δ
bids/layout/models.py	`93.08% <87.09%> (+0.14%)`	⬆️
bids/layout/index.py	`86.98% <95.34%> (+0.32%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bids/layout/models.py

effigies

Okay, so the problem that the json.dumps()/json.loads() cycle was solving was that we coerce to str. Why are we coercing to str?

Will test locally before committing this one...

bids/layout/models.py

effigies · 2023-08-03T14:27:50Z

Nope. The issue is that _value needs to be a str because that's what's stored in sqlite, and we need to either use json.dumps()/json.loads() or str()/eval() to get Python objects back.

adelavega · 2023-08-03T14:51:57Z

Ah. Here's the problem though--- on index it's creating those Tag objects, which trigger the full load dumps loads cycle, when the last loads cycle is unnecessary since we're only creating those objects to store to db, not to present to the User to access to variable.

So at the minimum we could probably save ourselves the final loads cycle if we can figure out how to skip it upon ingestion (and only do it when querying files and creating the Tag objects).

In SQLAlchemy this would be an easy fix becuase there's a JSONB type column that can take JSON directly.

adelavega · 2023-08-03T17:52:08Z

Great, your fix @effigies still leads keeps the performance gains

effigies · 2023-08-03T17:54:03Z

Cool. You may want to do a logic check. I think the tests would have caught failures, but I was a little rushed and may have unnecessarily duplicated work.

effigies

Some suggestions on the bulk save aggregation. Not sure if they're all good.

effigies · 2023-08-03T21:00:10Z

bids/layout/index.py

+                dir_fo = self._index_dir(d, config, force=force)
+                if dir_fo:
+                    all_file_objs += dir_fo


This is accounting for L188 returning None. Now that it returns a list, have it return an empty list there.

Suggested change

dir_fo = self._index_dir(d, config, force=force)

if dir_fo:

all_file_objs += dir_fo

all_file_objs.extend(self._index_dir(d, config, force=force))

effigies · 2023-08-03T21:03:15Z

bids/layout/index.py


-        return bf
+        return bf, file_objs


Should you not be adding bf to file_objs, since we were calling self.session.add(bf) on L238? Then you can just return file_objs here.

effigies · 2023-08-03T21:03:50Z

bids/layout/index.py

+                bf, file_objs = self._index_file(abs_fn, config_entities)
+                all_file_objs += file_objs


Then this would become:

Suggested change

bf, file_objs = self._index_file(abs_fn, config_entities)

all_file_objs += file_objs

all_file_objs.extend(self._index_file(abs_fn, config_entities))

effigies · 2023-08-03T21:05:31Z

bids/layout/index.py

        if match_vals:
            for _, (ent, val) in match_vals.items():
                tag = Tag(bf, ent, str(val), ent._dtype)
-                self.session.add(tag)
+                file_objs.append(tag)


This whole thing would be:

file_objs.extend( Tag(bf, ent, str(val), ent._dtype) for ent, val in match_vals.values() )

adelavega · 2023-08-03T21:15:34Z

I tried two more things, but I'm not sure if they're going to speed things up:

Using urjson instead of json for the serialize/deserialize dance, or just plain python str and eval vs json.dumps. It saves maybe 10% of indexing meta-data at most, so I'm not sure it's worth it, because its less "safe" and urjson is an external library.

The remaining overhead is more tied to making a SQLAlchemy object with attributes for every single tag.

effigies · 2023-08-03T21:20:07Z

What are you using to benchmark?

adelavega · 2023-08-03T21:27:58Z

I'm using this random dataset: ds002837 and I'm using line_profiler.
Should use %%timeit for more robust stats.

But I found another major speed up, instead of creating objects for every Tag we can just create a dictionary, and SQLAlchemy can do a raw sql insert using Tag's schema using those dictionaries. This cuts down on a ton of overhead and speeds things up by about 50%.

Waiting on tests to pass for that one, but I don't see why it wouldn't work.

adelavega · 2023-08-03T21:40:49Z

Actually, it's even faster to just use SQLAlchemy Core for bulk inserts. 1.9s vs 2.3s with bulk_insert_mappings vs 6s on master on ds002837

I think the only con is you dont have sessions, which is only really a problem when there's concurrent connections.

effigies · 2023-08-03T21:50:12Z

I think the assumption of most tools with a shared database is already that, if found, it's complete. So this isn't a regression.

adelavega · 2023-08-03T21:54:14Z

For some reason tests fail and all but 1 configuration. Need to dig into this more, it's possible that some of the guardrails were doing something.

adelavega · 2023-08-03T22:05:24Z

Guess what does require Session management? Our tests.

effigies · 2023-08-04T12:17:47Z

Small patch for your consideration. They feel like cleanups to me, but if they're distracting, ignore this comment.

diff --git a/bids/layout/index.py b/bids/layout/index.py
index 741e0bfd..62cbec26 100644
--- a/bids/layout/index.py
+++ b/bids/layout/index.py
@@ -186,7 +186,7 @@ class BIDSLayoutIndexer:
 
         # Derivative directories must always be added separately
         if self._layout._root.joinpath('derivatives') in abs_path.parents:
-            return None, None
+            return [], []
 
         config = list(config)  # Shallow copy
 
@@ -230,9 +230,8 @@ class BIDSLayoutIndexer:
             )
             if force is not False:
                 dir_bfs, dir_tag_dicts = self._index_dir(d, config, force=force)
-                if dir_bfs:
-                    all_bfs += dir_bfs
-                    all_tag_dicts += dir_tag_dicts
+                all_bfs += dir_bfs
+                all_tag_dicts += dir_tag_dicts
 
         return all_bfs, all_tag_dicts
 
@@ -250,11 +249,10 @@ class BIDSLayoutIndexer:
                 match_vals[e.name] = (e, m)
 
         # Create Entity <=> BIDSFile mappings
-        tag_dicts = []
-        if match_vals:
-            for _, (ent, val) in match_vals.items():
-                tag = _create_tag_dict(bf, ent, str(val), ent._dtype)
-                tag_dicts.append(tag)
+        tag_dicts = [
+            _create_tag_dict(bf, ent, val, ent._dtype)
+            for ent, val in match_vals.values()
+        ]
 
         return bf, tag_dicts
 
@@ -488,4 +486,4 @@ class BIDSLayoutIndexer:
                 
         self.session.bulk_save_objects(all_objs)
         self.session.bulk_insert_mappings(Tag, all_tag_dicts)
-        self.session.commit()
\ No newline at end of file
+        self.session.commit()

Otherwise, LGTM!

adelavega · 2023-08-04T16:08:45Z

Can you push that onto this branch? Looks fine to me

effigies · 2023-08-04T16:14:27Z

Done. They went in above my comment, though...

Various improvements based on line profiling

b1140ee

adelavega requested a review from effigies August 3, 2023 01:10

effigies reviewed Aug 3, 2023

View reviewed changes

bids/layout/models.py Outdated Show resolved Hide resolved

FIX: JSON values need to be evaluated

3938aee

effigies reviewed Aug 3, 2023

View reviewed changes

bids/layout/models.py Outdated Show resolved Hide resolved

bids/layout/models.py Show resolved Hide resolved

RF: Coerce types on tag init but avoid json.loads loop

fa41ed2

effigies reviewed Aug 3, 2023

View reviewed changes

Add associations to bulk insert

9ad5810

Add bulk mappings instead of bulk objects

249a28d

adelavega added 2 commits August 3, 2023 16:29

Revert to using json for serializing/deserializing

e9fb82c

Just use SQLAlchemy Core instead of ORM for Tag inserts

22d1418

Make SQLAlchemy Core inserts for file indexing

405fc0b

adelavega and others added 3 commits August 3, 2023 17:11

Revert to using ORM with bulk_mappings for session managment in tests"

b859764

RF: Uniformize returns of _index_dir/_index_file

2999992

RF: Simplify tag creation

d000552

effigies changed the title ~~2x indexing speed up~~ ENH: Speed up indexing by bulk committing to database Aug 4, 2023

effigies merged commit ae1bb00 into master Aug 7, 2023
24 checks passed

effigies deleted the enh/profile_index branch August 7, 2023 15:41

effigies mentioned this pull request Aug 16, 2023

FIX: Refactoring the B0FieldTransform implementation nipreps/sdcflows#346

Merged

pierre-nedelec mentioned this pull request Aug 21, 2023

Update to most recent pybids for faster indexing yeatmanlab/pyAFQ#990

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Speed up indexing by bulk committing to database #1013

ENH: Speed up indexing by bulk committing to database #1013

adelavega commented Aug 3, 2023

adelavega commented Aug 3, 2023 •

edited

Loading

codecov bot commented Aug 3, 2023 •

edited

Loading

effigies left a comment

effigies commented Aug 3, 2023

adelavega commented Aug 3, 2023 •

edited

Loading

adelavega commented Aug 3, 2023

effigies commented Aug 3, 2023

effigies left a comment

effigies Aug 3, 2023

effigies Aug 3, 2023

effigies Aug 3, 2023

effigies Aug 3, 2023 •

edited

Loading

adelavega commented Aug 3, 2023 •

edited

Loading

effigies commented Aug 3, 2023

adelavega commented Aug 3, 2023 •

edited

Loading

adelavega commented Aug 3, 2023 •

edited

Loading

effigies commented Aug 3, 2023

adelavega commented Aug 3, 2023

adelavega commented Aug 3, 2023 •

edited

Loading

effigies commented Aug 4, 2023

adelavega commented Aug 4, 2023

effigies commented Aug 4, 2023

		bf, file_objs = self._index_file(abs_fn, config_entities)
		all_file_objs += file_objs

	bf, file_objs = self._index_file(abs_fn, config_entities)
	all_file_objs += file_objs
	all_file_objs.extend(self._index_file(abs_fn, config_entities))

ENH: Speed up indexing by bulk committing to database #1013

ENH: Speed up indexing by bulk committing to database #1013

Conversation

adelavega commented Aug 3, 2023

adelavega commented Aug 3, 2023 • edited Loading

codecov bot commented Aug 3, 2023 • edited Loading

Codecov Report

effigies left a comment

Choose a reason for hiding this comment

effigies commented Aug 3, 2023

adelavega commented Aug 3, 2023 • edited Loading

adelavega commented Aug 3, 2023

effigies commented Aug 3, 2023

effigies left a comment

Choose a reason for hiding this comment

effigies Aug 3, 2023

Choose a reason for hiding this comment

effigies Aug 3, 2023

Choose a reason for hiding this comment

effigies Aug 3, 2023

Choose a reason for hiding this comment

effigies Aug 3, 2023 • edited Loading

Choose a reason for hiding this comment

adelavega commented Aug 3, 2023 • edited Loading

effigies commented Aug 3, 2023

adelavega commented Aug 3, 2023 • edited Loading

adelavega commented Aug 3, 2023 • edited Loading

effigies commented Aug 3, 2023

adelavega commented Aug 3, 2023

adelavega commented Aug 3, 2023 • edited Loading

effigies commented Aug 4, 2023

adelavega commented Aug 4, 2023

effigies commented Aug 4, 2023

adelavega commented Aug 3, 2023 •

edited

Loading

codecov bot commented Aug 3, 2023 •

edited

Loading

adelavega commented Aug 3, 2023 •

edited

Loading

effigies Aug 3, 2023 •

edited

Loading

adelavega commented Aug 3, 2023 •

edited

Loading

adelavega commented Aug 3, 2023 •

edited

Loading

adelavega commented Aug 3, 2023 •

edited

Loading

adelavega commented Aug 3, 2023 •

edited

Loading