This repository has been archived by the owner on Oct 8, 2020. It is now read-only.

Clustering algorithm based on HashingTF LSH method #25

Open

agarwalpratikkumar wants to merge 7 commits into SANSA-Stack:develop from agarwalpratikkumar:clustering_LSH

agarwalpratikkumar commented Nov 12, 2019

No description provided.


          Clustering algorithm based on HashingTF LSH method

38027bd

GezimSejdiu self-requested a review

November 22, 2019 20:29

GezimSejdiu reviewed

View reviewed changes

Member

GezimSejdiu left a comment

Hi @agarwalpratikkumar ,

thank you for your contribution! I just did a bit of review and feel free to comment on my comments.
Please, provide some basic unit-tests for your functionality before we consider merging the PR.

Best regards,

...k/src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySentiveHashing.scala Outdated Show resolved Hide resolved

...k/src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySentiveHashing.scala Outdated Show resolved Hide resolved

...k/src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySentiveHashing.scala Outdated Show resolved Hide resolved

agarwalpratikkumar and others added 2 commits

January 1, 2020 16:33


          updated messages

794ba69


          Merge branch 'develop' into clustering_LSH

f8b432d

GezimSejdiu reviewed

View reviewed changes

Member

GezimSejdiu left a comment

Hi @agarwalpratikkumar ,

thanks a lot for the change.

b.t.w. I haven't seen any uni-test which covers some of the functionality provided here? Please, could you add some basic unit-test such that we can go further with the PR.

Many thanks in advance.

Best regards,

GezimSejdiu added this to the 0.8 milestone

agarwalpratikkumar added 2 commits

February 12, 2020 11:32


          Udated triple parsing

f1fb2fa


          Merge branch 'clustering_LSH' of https://github.com/agarwalpratikkuma…

2eb53a9

…r/SANSA-ML into clustering_LSH

GezimSejdiu added the ML label


          Added directory removal function in LocalitySensitiveHashing. Added d…

00218a9

…ependency of graphframes and test file for unit-testing

agarwalpratikkumar requested a review from GezimSejdiu

February 20, 2020 12:37


          Merge branch 'develop' into clustering_LSH

808a7ea

Member

GezimSejdiu commented Feb 21, 2020

Hi @agarwalpratikkumar ,

thanks a lot for the update. I can see that Travis CI is still complaining about the graph-frames dependency: https://travis-ci.org/SANSA-Stack/SANSA-ML/builds/652965675#L5810 . Would it be possible that you try to fix it before we merge the PR.

Best regards,

Author

agarwalpratikkumar commented Feb 25, 2020

Hi @GezimSejdiu
I added graph-frames as an external jar in the eclipse but I was unable to push it on git. Can you please tell me how can I add this jar on git.

GezimSejdiu requested review from HajiraJabeen and LorenzBuehmann

May 17, 2020 21:23

GezimSejdiu reviewed

View reviewed changes

sansa-ml-spark/pom.xml

+                      <dependency>
+                          <groupId>graphframes</groupId>
+                          <artifactId>graphframes</artifactId>
+                          <version>0.7.0-spark2.4-s_2.11</version>

Member

GezimSejdiu May 17, 2020

Great! Thanks for adding the graphframes dependency. The project seems to build now. Consider using the latest version i.e. 0.8.0 and also format it :) -- i.e. align with other dependency lists.

Member

LorenzBuehmann May 19, 2020 •

edited

Loading

You have to add a Maven repo, nobody wants to add local Jars in the project classpath manually nowadays.

It is also mentioned on the GraphFrames Maven artifact page:

Note: this artifact it located at SparkPackages repository (https://dl.bintray.com/spark-packages/maven/)

That means, we have to add

<repository>
	<id>SparkPackages</id>
	<name>Repo for Spark packages</name>
	<url>https://dl.bintray.com/spark-packages/maven/</url>
		<snapshots>
			<enabled>false</enabled>
		</snapshots>
</repository>

...src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySensitiveHashing.scala

+              /*
+               *
+               * Clustering

Member

GezimSejdiu May 17, 2020

Give a bit of a better description i.e name of the cluster and what id does.

...src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySensitiveHashing.scala

+                def getOnlyPredicates(parsedTriples: RDD[(String, String, Object)]): RDD[(String, String)] = {
+                  return parsedTriples.map(f => {
+                    val key = f._1 + "" // Subject is the key

Member

GezimSejdiu May 17, 2020

Why you are adding "" in the end? For converting them to a string? Can you just use toString() if needed?

...src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySensitiveHashing.scala

+                }
+                def minHashLSH(featuredData_Df: DataFrame): (MinHashLSHModel, DataFrame) = {
+                  val mh = new MinHashLSH().setNumHashTables(3).setInputCol("features").setOutputCol("HashedValues")

Member

GezimSejdiu May 17, 2020

Is this number 3 fixed when we setNumHashTables or it can also be adjusted based on the use-case? We shouldn' use any hard-coded values.

...src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySensitiveHashing.scala

+                }
+                def approxSimilarityJoin(model: MinHashLSHModel, transformedData_Df: DataFrame): Dataset[_] = {
+                  val threshold = 0.40

Member

GezimSejdiu May 17, 2020

How and who define this threshold i.e. to be 0.40? I consider that all these values can be done via a configuration file or even as arguments when running the algorithm.

...src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySensitiveHashing.scala

+                  val connected_components = g.connectedComponents.run()
+                  //Removing the graphframes checkpoint directory
+                  val file_path = Paths.get(dir_path)

Member

GezimSejdiu May 17, 2020

Is this going to work in the cluster? or the dir_path of the checkpoint is always considered to be held on the driver? We should use any file system configurations as soon as it is needed to be distributed across the cluster.

Member

LorenzBuehmann May 19, 2020

why would somebody change the checkpoint dir locally? Also, the whole param isn't documented in the constructor. In my opinion, that should be configured during Spark setup or maybe if you really need it during Spark submit but not in the code.

...test/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySensitiveHashingTest.scala

+                import net.sansa_stack.rdf.spark.io._
+                val path = getClass.getResource("/Cluster/testSample.nt").getPath
+                val graphframeCheckpointPath = "/sansa-ml-spark_2.11/src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/graphframeCheckpoints"

Member

GezimSejdiu May 17, 2020

You can also use the getResource(..) method for the checkpoints as well or you explicitly need to write it on the class level? which is a bit strange to me. Shall we consider moving it as a resource? when it is needed to be generated?

...test/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySensitiveHashingTest.scala

+                  val triples = spark.rdf(Lang.NTRIPLES)(path)
+                  val cluster_ = new LocalitySensitiveHashing(spark, triples, graphframeCheckpointPath)
+                  cluster_.run()

Member

GezimSejdiu May 17, 2020

Can we get the centroids? and do some validations insted of just run method? and perform assert(true) it will always pass right ;) . Check here for inspiration: https://github.com/apache/spark/tree/master/mllib/src/test/scala/org/apache/spark/ml/clustering

Member

LorenzBuehmann May 19, 2020

the run() method doesn't return anything, so I'm wondering how somebody would use this code? I mean, it computes clusters and also their quality, but nothing is returned nor written to disk somewhere ... that can't be useful in its current state.

LorenzBuehmann reviewed

View reviewed changes

Member

LorenzBuehmann left a comment •

edited

Loading

That code is too far away from being merged to develop.

I'll keep it short as Gezim already pointed to particular parts in the code.

this algorithm will only work on data with URIs having some kind of human-readable data as local name - yes, it works for some dataset, but it will clearly be weird for e.g. Wikidata with all it's QXX URIs
related to point 1, during workflow only the local names and literal values are forwared, so, how could anybody make use of the final result? I mean, how to get back the clusters with the original RDF resources, i.e. URIs and literals?
also, the algorithm does some computations but never returns any result nor does it write anything to disk - or I might have missed that part in the code

By the way, is there any document regarding the workflow? I'm referring to the research idea behind the code.

JensLehmann modified the milestones: 0.8, 0.9

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

ML