-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix performance issue when exporting many groups #3681
Conversation
Is there any benchmark where you compare times to get the result? It should depend on what group.nodes is exactly doing, but in the end I guess the performance bottleneck is loading the nodes, not the query. Just out of interest (since the title is "fix performance issue") |
# Could project on ['id', 'uuid', 'node_type'] for further performance enhancement | ||
group_qb.append(orm.Node, with_group='groups', project=['*']) | ||
|
||
for row in group_qb.all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use group_qb.iterall() here? It might lead to further improvement because the results would be fetched in batches in such case and not all together as is done now. If the subsequent code also takes time this could speed up the time-to-solution.
Here come some actual numbers. Test set: export of 67k groups containing 5 nodes each.
My guess was that the bottleneck are the N queries you make for N groups and I think the benchmark results back this up. However, as you say and as I point out in the comment, there might still be significant speedup from avoiding to construct the ORM objects altogether, as well as memory savings. @sphuber: Do you happen to know whether there is an easy check to replace these checks aiida-core/aiida/tools/importexport/dbexport/__init__.py Lines 232 to 235 in 9172c60
by checks on the node_type ?
|
Yes, you can select those sets of nodes directly on the
and for
In those particular query builder definitions you can simply make two builders querying on the specific node class
|
88e406d
to
4a001ef
Compare
Thanks @sphuber - the latest implementation now takes 11s on the test set (down from 204s on |
|
||
data_results = orm.QueryBuilder(**qh_groups).append(orm.Data, project=['id', 'uuid'], with_group='groups').all() | ||
|
||
from builtins import zip # pylint: disable=redefined-builtin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this I get an error that zip
is a module and not callable ( we have a module called zip
in the same folder).
If you want me to rename the module instead, let me know.
4a001ef
to
b3cdfa5
Compare
When providing groups to `verdi export`, it was looping over all groups and using the `Group.nodes` iterator to retrieve the nodes contained in the groups. This results (at least) in one query per group, and is therefore every inefficient for large numbers of groups. The new implementation replaces this by two queries, one for Data nodes and one for Process nodes. It also no longer constructs the ORM objects since they are unnecessary.
b3cdfa5
to
f6c05d4
Compare
@sphuber This is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff, thanks a lot @ltalirz !
Do you want me to write the commit message? |
No, I was just leaving the honors to yourself, commit message looks great :) |
}, tag='groups' | ||
).queryhelp | ||
|
||
# Delete this import once the dbexport.zip module has been renamed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future: This should be changed in #3242
When providing groups to
verdi export
, it was looping over all groupsand using the
Group.nodes
iterator to retrieve the nodes contained inthe groups. This results (at least) in one query per group, and is
therefore every inefficient for large numbers of groups. The new
implementation replaces this by a single query.