Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from leiden to leidenbase #6792

Merged
merged 16 commits into from
Dec 19, 2024
Merged

Conversation

alanocallaghan
Copy link
Contributor

@alanocallaghan alanocallaghan commented Dec 16, 2022

The Leiden implementation provided by leiden is absurdly slow, way more than I'd expect from the overhead of calling reticulate. This switches for the (mostly) equivalent leidenbase. Tests pass locally for me, see also the demo below comparing 3 Leiden implementations.

See #6754

library("leidenbase")
library("leidenAlg")
library("leiden")
library("igraph")
library("microbenchmark")

fpath <- system.file('testdata', 'igraph_n1500_edgelist.txt.gz', package='leidenbase')
zfp <- gzfile(fpath)
igraph <- read_graph(file = zfp, format='edgelist')

microbenchmark(
    leidenAlg::leiden.community(igraph),
    leidenbase::leiden_find_partition(igraph),
    leiden::leiden(igraph),
    times = 10
)
# Unit: milliseconds
#                           expr         min          lq        mean      median
#       leiden.community(igraph)    52.98505    55.16530    63.16430    55.95267
#  leiden_find_partition(igraph)    29.22238    30.54163    39.15174    37.57693
#                 leiden(igraph) 17140.52418 20671.12398 21172.11517 21128.94100
#           uq         max neval cld
#     75.38524    80.10432    10  a 
#     40.30238    56.91094    10  a 
#  22268.67866 24597.89747    10   b


r1 <- leiden_find_partition(igraph,
    partition_type = "RBConfigurationVertexPartition",
    seed = 1234,
    resolution_parameter = 1
    
)
r2 <- leiden(igraph,
    partition_type = "RBConfigurationVertexPartition",
    resolution_parameter = 1, 
    seed = 1234
)
table(r1$membership, r2)
#       1   2   3   4   5   6   7   8   9
#   1 295   0   0   0   0   0   0   0   0
#   2   0 209   0   0   0   0   0   0   0
#   3   0   0 201   0   0   0   0   0   0
#   4   0   0   0 191   0   0   0   0   0
#   5   0   0   0   0 175   0   0   0   0
#   6   0   0   0   0   0 170   0   0   0
#   7   0   0   0   0   0   0  93   0   0
#   8   0   0   0   0   0   0   0  85   0
#   9   0   0   0   0   0   0   0   0  81

@szhorvat
Copy link

szhorvat commented Dec 20, 2022

If you are looking to reduce overhead to a minimum, verify whether you are relying on any feature not already available in igraph::cluster_leiden(), and if not, just use that function. The various "partition type" settings are reproducible by setting the vertex_weight values appropriately. See the general form of the objective function in the documentation, where $n$ represents the vertex weigths, and compare with the objective functions of the various "partition types". For example, setting $n_i = k_i / \sqrt{2m}$ where $k_i$ is the vertex degree/strength and $m$ is the edge count / total edge weight yields the modularity objective function (i.e. RBConfigurationVertexPartition).

@alanocallaghan
Copy link
Contributor Author

It's a bit out of my wheelhouse to go through and reimplement the different objective functions to be honest, I just thought this was a useful contribution that doesn't increase the dependency count.

@alanocallaghan
Copy link
Contributor Author

Pinging @saketkc @samuel-marsh as people who seem to be merging code:

Can you have a look at this? Feel free to say you're not interested and close it, or that you want more info/more tests showing equivalence. Either is fine

@samuel-marsh
Copy link
Collaborator

I will second @alanocallaghan ping that would be great to have this included!

I’m not member of dev team so my actions are limited here but adding @Gesmira from Seurat team.

Best,
Sam

@dcollins15 dcollins15 self-requested a review March 1, 2024 20:32
@alanocallaghan
Copy link
Contributor Author

Just think of the aggregate wasted CPU cycles across every Seurat analysis for no gain

@alanocallaghan
Copy link
Contributor Author

Again bump to say

Feel free to say you're not interested and close it, or that you want more info/more tests showing equivalence. Either is fine

@samuel-marsh
Copy link
Collaborator

Hi Seurat Team,

I just thought I would bump this to potentially get new eyes on this in the new year.

It would be really great to have this incorporated as the python pass-through is still very slow compared to R native implementations and having faster leiden implementation would help larger workflows significantly.

@dcollins15 hope you don’t mind me tagging you here as you are one I’ve seen doing most of merging/code updates lately.

Thanks as always!
Sam

@dcollins15
Copy link
Contributor

Thanks for the poke @samuel-marsh!

@dcollins15 hope you don’t mind me tagging you here as you are one I’ve seen doing most of merging/code updates lately.

Please feel free to continue tagging me on things you think are important 👌

As you can probably tell from the recent flurry of activity, I'm planning to submit a CRAN release for v5.2.0 in the next couple of days. Given that this change shouldn't affect any results, I think this is a no-brainer to include. ATM I'm working on merging in #8271—once it lands this change will be next to go 🚀

I'll try to do a quick sanity check of the clustering results before I start rebasing/pushing up documentation updates. The main ToDos are:

  1. Rebase alanocallaghan:master onto satijalab:develop and resolve conflicts.
  2. Update the docstring for RunLeiden to drop references to leidenalg and reticulate.
  3. Regenerate roxygen2 documentation.
  4. Update changelog.
  5. Bump version.

@alanocallaghan I'm happy to take care of these updates since you've given us the necessary permissions 🙌

I'll do my best to rebase carefully but in the future, it saves us from having to do any Git-jitsu if you can avoid using branch names that conflict with the ones in the main repository (i.e.master, develop), see #8294.

@dcollins15
Copy link
Contributor

@alanocallaghan this turned out to be quite a bit more involved to get running than I initially realized but I think everything is working as expected now 🚀

In addition to the items I laid out above, I also ended up:

  • Dropping the check for the leidengalg Python package from RunLeiden.
  • Deprecating the method parameter for RunLeiden and FindClusters.
  • Fixing up the docstrings for RunLeiden and FindClusters.
  • Adding a smoke test for FindClusters.

If @samuel-marsh could give this a quick sanity check that would be fantastic—specifically the way I'm deprecating the method parameter. The smoke test isn't particularly comprehensive but I think it's enough to make us reasonably sure this won't introduce any real regressions—any extra checks you can quickly run are always appreciated 🙏

Copy link
Contributor

@dcollins15 dcollins15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to go ahead and merge this now so I can continue on with the release 🚀 Please do let me know if you spot any issues

@dcollins15 dcollins15 merged commit 6b1c25a into satijalab:develop Dec 19, 2024
2 checks passed
@alanocallaghan
Copy link
Contributor Author

Awesome thanks! Sorry it was more work than expected, I'd of course have been happy to do the small extras but I usually default to changelog and documentation changes being done by authors for consistency.

Hope the next release goes smoothly for yous!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants