Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count number of connected components more efficiently than length(connected_components(g)) #407

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

thchr
Copy link
Contributor

@thchr thchr commented Nov 14, 2024

This adds a new function count_connected_components, which returns the same value as length(connected_components(g)) but substantially faster by avoiding unnecessary allocations. In particular, connected_components materializes component vectors that are not actually necessary for determining the number of components.
Similar reasoning also lets one optimize is_connected a bit: did that also.

While I was there, I also improved connected_components! slightly: previously, it was allocating a new queue for every new "starting vertex" in the search; but the queue is always empty when it's time to add a new vertex at that point, so there's no point in instantiating a new vector.
To enable users who might want to call connected_components! many times in a row to reduce allocations further (I am one such user), I also made it possible to pass this queue as an optimization.

Finally, connected_components! is very useful and would make sense to export; so I've done that here.

Cc @gdalle, if you have time to review.

@thchr
Copy link
Contributor Author

thchr commented Nov 14, 2024

For the doctest example of g = Graph(Edge.([1=>2, 2=>3, 3=>1, 4=>5, 5=>6, 6=>4, 7=>8])), count_connected_components is about twice as fast as length∘connected_components (179 ns vs. 290 ns). Using the buffers, it is faster still (105 ns).

Copy link

codecov bot commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.31%. Comparing base (24539fd) to head (d5e0e37).
Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #407   +/-   ##
=======================================
  Coverage   97.30%   97.31%           
=======================================
  Files         117      117           
  Lines        6948     6963   +15     
=======================================
+ Hits         6761     6776   +15     
  Misses        187      187           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -1,26 +1,32 @@
# Parts of this code were taken / derived from Graphs.jl. See LICENSE for
# licensing details.
"""
connected_components!(label, g)
connected_components!(label, g, [search_queue])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am all for performance improvements. But I am a bit skeptical if it is worth making the interface more complicated.

Almost all graph algorithms need some kind of of work buffer, so we could have something like in al algorithms but in the end it should be the job for Julia's allocator to verify if there is some suitable piece of memory lying around. We can help it by using sizehint! with a suitable heuristic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this will usually not be relevant; in my case it is though, and is the main reason I made the changes. I also agree that there is a trade off between performance improvements and complications of the API. On the other hand, I think passing such work buffers as optional arguments is a good solution to such trade-offs: for most users, the complication can be safely ignored and shouldn't complicate their lives much.

As you say, there are potentially many algorithms in Graphs.jl that could take a work buffer; in light of that, maybe this could be more palatable if we settle on a unified name for these kinds of optional buffers, so that it lowers the complications by standardizing across methods.
Maybe just work_buffer (and, if there are multiple, work_buffer1, work_buffer2, etc?)

Copy link
Member

@gdalle gdalle Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this then all functions should take exactly one work_buffer (possibly a tuple) and have an appropriate function to initialize the buffer. I think it is a major change which should be discussed separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think if this is really important for your use case you can either

  • Create a version that uses a buffer in the Experimental submodule. Currently we don't guarantee semantic versioning there - this allows use to remove things in the future without breaking the API.
  • Or as this code is very simple you might just copy it to your own repository.

But just to clarify - your problem is not that you are building graphs by adding edges until they are connected? Because if that is the issue, there is a much better algorithm.

Copy link
Contributor Author

@thchr thchr Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a version that uses a buffer in the Experimental submodule. Currently we don't guarantee semantic versioning there - this allows use to remove things in the future without breaking the API.

I won't be able to find the time for factoring this out into the Experimental submodule, unfortunately.
I'm happy to e.g., add an admonition to the docstring, indicating that the work buffer arguments are unstable API which are subject to breakage though. Factoring this into a submodule, piecing it back together, and adding multiple doc-strings across modules, and eventually loading this behind a Graphs.Experimental call is more fiddling than I'm up for.

Or as this code is very simple you might just copy it to your own repository.

Indeed, I have and will just continue to do that, yep.

But just to clarify - your problem is not that you are building graphs by adding edges until they are connected? Because if that is the issue, there is a much better algorithm.

No, I'm not doing that; appreciate the check though.


As a general side note - and please know that I appreciate these reviews very much (!) and your efforts on what is no doubt spare time (!!) - but I wonder if the level of scrutiny and optimization that many PRs here go through is optimal: I understand the intent and the aim of making stable API and of getting good, maintainable code. But I think there's a risk of trading off too much towards these goals, at the cost of vibrancy and community engagement. From my experience here, there's room for leaning more towards "is this better than what we previously had" over "could this be even better".

PRs like this usually happen on time "stolen away" from our day-jobs, and the odds of returning to PRs for edits, however small, go down very quickly with time; similarly, the expectation that a PR will be a multi-iteration process reduces the likelihood it will be made.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the feedback, and I lean in the same direction. The JuliaGraphs community calls have pretty much dried up recently and we don't have much of a community to start with, so it's in our best interest to make engagement rewarding instead of tiresome.
I'm expecting an answer this month on funding I applied to which could serve to revitalize this ecosystem, hopefully I'll have good news to share soon.

src/connectivity.jl Outdated Show resolved Hide resolved
src/connectivity.jl Outdated Show resolved Hide resolved
@simonschoelly
Copy link
Member

For the doctest example of g = Graph(Edge.([1=>2, 2=>3, 3=>1, 4=>5, 5=>6, 6=>4, 7=>8])), count_connected_components is about twice as fast as length∘connected_components (179 ns vs. 290 ns). Using the buffers, it is faster still (105 ns).

We should not do benchmarks on such small graphs unless the algorithm has a huge complexity and is slow even on very small graphs. Otherwise the benchmark is way too noisy and also does not really reflect the situations where this library is used.

@thchr
Copy link
Contributor Author

thchr commented Nov 21, 2024

We should not do benchmarks on such small graphs unless the algorithm has a huge complexity and is slow even on very small graphs. Otherwise the benchmark is way too noisy and also does not really reflect the situations where this library is used.

What are some good go-to defaults for testing? This is a thing I'm running up against frequently, I feel: I am not sure which graphs to test against, and anything beyond small toy examples are not easily accessible via convenience constructors in Graphs.

As context, in my situation the graphs are rarely larger than 50-100 vertices; my challenge is that I need to consider a huge number of permutations of such graphs, so performance in the small-graph case is relevant to me.

@gdalle
Copy link
Member

gdalle commented Nov 21, 2024

What are some good go-to defaults for testing? This is a thing I'm running up against frequently, I feel: I am not sure which graphs to test against, and anything beyond small toy examples are not easily accessible via convenience constructors in Graphs.

I have opened this issue to discuss further:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants