You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of the resolve_dependencies method in the workflow is inefficient for workflows with a large number of jobs and dependencies. Specifically, the method uses find_job, which performs a linear search over the jobs array for every dependency. This results in a time complexity of O(n * d), where n is the number of jobs and d is the number of dependencies.
For workflows with thousands or hundreds of thousands of jobs and dependencies, this becomes a significant bottleneck, causing high execution times during workflow creation.
Steps to Reproduce
Create a workflow with a large number of jobs (e.g., 100,000).
Define dependencies between these jobs.
Measure the time taken for the resolve_dependencies method to complete.
Impact
Performance degrades significantly as the number of jobs and dependencies increases.
For workflows with 100,000 jobs and a proportional number of dependencies, this can take tens of seconds or longer, impacting user experience and system throughput.
In fact, I was unable to get a workflow with 100k dependencies to resolve within the bounds of my patience!
Proposed Solution
To optimize the resolve_dependencies method, we introduce a precomputed lookup table (@job_lookup) stored in the workflow's state. This lookup table allows all calls to find_job to perform O(1) lookups, rather than performing O(n) linear searches through the jobs array. The original regex-based fallback logic is preserved.
Updated Implementation
defresolve_dependencies@dependencies.eachdo |dependency|
from=find_job(dependency[:from])to=find_job(dependency[:to])to.incoming << dependency[:from]from.outgoing << dependency[:to]endend# Precompute the lookup tabledefbuild_job_lookup@job_lookup={klass: jobs.each_with_object({}){ |job,lookup| lookup[job.klass.to_s]=job},name: jobs.each_with_object({}){ |job,lookup| lookup[job.name.to_s]=job}}end# Optimized find_job with regex-based fallbackdeffind_job(identifier)# Build the lookup table if it doesn't existbuild_job_lookupunless@job_lookup# Use regex to determine which lookup to usematch_data=/(?<klass>\w*[^-])-(?<identifier>.*)/.match(identifier.to_s)ifmatch_data.nil?# Lookup by klass if the pattern doesn't match@job_lookup[:klass][identifier.to_s]else# Lookup by name if the pattern matches@job_lookup[:name][identifier.to_s]endend
Changes
Precomputed Lookup Table:
The build_job_lookup method creates a hash (@job_lookup) with two sub-hashes:
:klass: Maps job.klass.to_s to the job object.
:name: Maps job.name.to_s to the job object.
The lookup table is built on-demand and ensures O(1) lookups.
Regex-Based Fallback:
The logic in find_job uses a regex to determine whether to look up the job by name or klass, preserving the behavior of the original implementation.
Automatic Rebuild:
The lookup table is rebuilt if it doesn’t exist (@job_lookup is nil).
To ensure consistency, the lookup table should be invalidated when jobs is modified (e.g., by setting @job_lookup = nil).
Improved Performance:
Eliminates linear searches over the jobs array in find_job.
Reduces the time complexity of resolve_dependencies to O(d).
Performance Comparison
Workflow Size
Old Time Complexity
Old Execution Time
New Time Complexity
New Execution Time
1,000 Jobs
O(n * d)
~1 second
O(d)
<0.1 second
10,000 Jobs
O(n * d)
~10 seconds
O(d)
~1 second
100,000 Jobs
O(n * d)
Couldn’t measure
O(d)
~20 seconds
In testing, workflows with 100,000 jobs and dependencies were reduced from unmanageable execution times to approximately 20 seconds.
Testing
Functional Tests:
Verify that find_job continues to resolve jobs correctly for both klass and name.
Performance Tests:
Compare the execution time of resolve_dependencies before and after the optimization for large workflows.
Edge Cases:
Test scenarios where:
A dependency matches klass (e.g., from: "JobA").
A dependency matches name (e.g., from: "JobA-123").
A dependency is invalid (e.g., from: "NonexistentJob").
Advantages
Backward Compatibility:
The functionality remains identical to the original implementation.
No breaking changes are introduced.
Consistent Performance:
find_job now consistently performs O(1) lookups, even when called multiple times.
Scalability:
Handles workflows with 100,000+ jobs and dependencies efficiently.
Simplified Logic:
Centralized lookup logic in find_job ensures all calls benefit from the optimization.
Conclusion
This optimization drastically improves the performance of resolve_dependencies for large workflows, while preserving the exact behavior of the original implementation. It should be a backward-compatible change with no impact on functionality, only improving execution time.
Would you be open to a PR implementing this change? I’d be happy to contribute! 🚀
The text was updated successfully, but these errors were encountered:
Description
The current implementation of the
resolve_dependencies
method in the workflow is inefficient for workflows with a large number of jobs and dependencies. Specifically, the method usesfind_job
, which performs a linear search over thejobs
array for every dependency. This results in a time complexity of O(n * d), wheren
is the number of jobs andd
is the number of dependencies.For workflows with thousands or hundreds of thousands of jobs and dependencies, this becomes a significant bottleneck, causing high execution times during workflow creation.
Steps to Reproduce
resolve_dependencies
method to complete.Impact
Proposed Solution
To optimize the
resolve_dependencies
method, we introduce a precomputed lookup table (@job_lookup
) stored in the workflow's state. This lookup table allows all calls tofind_job
to perform O(1) lookups, rather than performing O(n) linear searches through thejobs
array. The original regex-based fallback logic is preserved.Updated Implementation
Changes
Precomputed Lookup Table:
build_job_lookup
method creates a hash (@job_lookup
) with two sub-hashes::klass
: Mapsjob.klass.to_s
to the job object.:name
: Mapsjob.name.to_s
to the job object.Regex-Based Fallback:
find_job
uses a regex to determine whether to look up the job byname
orklass
, preserving the behavior of the original implementation.Automatic Rebuild:
@job_lookup
isnil
).jobs
is modified (e.g., by setting@job_lookup = nil
).Improved Performance:
jobs
array infind_job
.resolve_dependencies
to O(d).Performance Comparison
In testing, workflows with 100,000 jobs and dependencies were reduced from unmanageable execution times to approximately 20 seconds.
Testing
Functional Tests:
find_job
continues to resolve jobs correctly for bothklass
andname
.Performance Tests:
resolve_dependencies
before and after the optimization for large workflows.Edge Cases:
klass
(e.g.,from: "JobA"
).name
(e.g.,from: "JobA-123"
).from: "NonexistentJob"
).Advantages
Backward Compatibility:
Consistent Performance:
find_job
now consistently performs O(1) lookups, even when called multiple times.Scalability:
Simplified Logic:
find_job
ensures all calls benefit from the optimization.Conclusion
This optimization drastically improves the performance of
resolve_dependencies
for large workflows, while preserving the exact behavior of the original implementation. It should be a backward-compatible change with no impact on functionality, only improving execution time.Would you be open to a PR implementing this change? I’d be happy to contribute! 🚀
The text was updated successfully, but these errors were encountered: