-
Notifications
You must be signed in to change notification settings - Fork 16
Clustering algorithm based on HashingTF LSH method #25
base: develop
Are you sure you want to change the base?
Clustering algorithm based on HashingTF LSH method #25
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @agarwalpratikkumar ,
thank you for your contribution! I just did a bit of review and feel free to comment on my comments.
Please, provide some basic unit-tests for your functionality before we consider merging the PR.
Best regards,
...k/src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySentiveHashing.scala
Outdated
Show resolved
Hide resolved
...k/src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySentiveHashing.scala
Outdated
Show resolved
Hide resolved
...k/src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/LocalitySentiveHashing.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @agarwalpratikkumar ,
thanks a lot for the change.
b.t.w. I haven't seen any uni-test which covers some of the functionality provided here? Please, could you add some basic unit-test such that we can go further with the PR.
Many thanks in advance.
Best regards,
…ependency of graphframes and test file for unit-testing
Hi @agarwalpratikkumar , thanks a lot for the update. I can see that Travis CI is still complaining about the graph-frames dependency: https://travis-ci.org/SANSA-Stack/SANSA-ML/builds/652965675#L5810 . Would it be possible that you try to fix it before we merge the PR. Best regards, |
Hi @GezimSejdiu |
<dependency> | ||
<groupId>graphframes</groupId> | ||
<artifactId>graphframes</artifactId> | ||
<version>0.7.0-spark2.4-s_2.11</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks for adding the graphframes
dependency. The project seems to build now. Consider using the latest version i.e. 0.8.0
and also format it :) -- i.e. align with other dependency lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to add a Maven repo, nobody wants to add local Jars in the project classpath manually nowadays.
It is also mentioned on the GraphFrames Maven artifact page:
Note: this artifact it located at SparkPackages repository (https://dl.bintray.com/spark-packages/maven/)
That means, we have to add
<repository>
<id>SparkPackages</id>
<name>Repo for Spark packages</name>
<url>https://dl.bintray.com/spark-packages/maven/</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
|
||
/* | ||
* | ||
* Clustering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give a bit of a better description i.e name of the cluster and what id does.
|
||
def getOnlyPredicates(parsedTriples: RDD[(String, String, Object)]): RDD[(String, String)] = { | ||
return parsedTriples.map(f => { | ||
val key = f._1 + "" // Subject is the key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you are adding ""
in the end? For converting them to a string? Can you just use toString()
if needed?
} | ||
|
||
def minHashLSH(featuredData_Df: DataFrame): (MinHashLSHModel, DataFrame) = { | ||
val mh = new MinHashLSH().setNumHashTables(3).setInputCol("features").setOutputCol("HashedValues") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this number 3
fixed when we setNumHashTables
or it can also be adjusted based on the use-case? We shouldn' use any hard-coded values.
} | ||
|
||
def approxSimilarityJoin(model: MinHashLSHModel, transformedData_Df: DataFrame): Dataset[_] = { | ||
val threshold = 0.40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How and who define this threshold i.e. to be 0.40
? I consider that all these values can be done via a configuration file or even as arguments when running the algorithm.
val connected_components = g.connectedComponents.run() | ||
|
||
//Removing the graphframes checkpoint directory | ||
val file_path = Paths.get(dir_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to work in the cluster? or the dir_path of the checkpoint is always considered to be held on the driver? We should use any file system configurations as soon as it is needed to be distributed across the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would somebody change the checkpoint dir locally? Also, the whole param isn't documented in the constructor. In my opinion, that should be configured during Spark setup or maybe if you really need it during Spark submit but not in the code.
import net.sansa_stack.rdf.spark.io._ | ||
|
||
val path = getClass.getResource("/Cluster/testSample.nt").getPath | ||
val graphframeCheckpointPath = "/sansa-ml-spark_2.11/src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/graphframeCheckpoints" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use the getResource(..)
method for the checkpoints as well or you explicitly need to write it on the class level? which is a bit strange to me. Shall we consider moving it as a resource? when it is needed to be generated?
val triples = spark.rdf(Lang.NTRIPLES)(path) | ||
|
||
val cluster_ = new LocalitySensitiveHashing(spark, triples, graphframeCheckpointPath) | ||
cluster_.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get the centroids? and do some validations insted of just run method? and perform assert(true)
it will always pass right ;) . Check here for inspiration: https://github.com/apache/spark/tree/master/mllib/src/test/scala/org/apache/spark/ml/clustering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the run()
method doesn't return anything, so I'm wondering how somebody would use this code? I mean, it computes clusters and also their quality, but nothing is returned nor written to disk somewhere ... that can't be useful in its current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That code is too far away from being merged to develop.
I'll keep it short as Gezim already pointed to particular parts in the code.
- this algorithm will only work on data with URIs having some kind of human-readable data as local name - yes, it works for some dataset, but it will clearly be weird for e.g. Wikidata with all it's
QXX
URIs - related to point 1, during workflow only the local names and literal values are forwared, so, how could anybody make use of the final result? I mean, how to get back the clusters with the original RDF resources, i.e. URIs and literals?
- also, the algorithm does some computations but never returns any result nor does it write anything to disk - or I might have missed that part in the code
By the way, is there any document regarding the workflow? I'm referring to the research idea behind the code.
No description provided.