Skip to content
This repository has been archived by the owner on Oct 8, 2020. It is now read-only.

Clustering algorithm based on HashingTF LSH method #25

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

agarwalpratikkumar
Copy link

No description provided.

@GezimSejdiu GezimSejdiu self-requested a review November 22, 2019 20:29
Copy link
Member

@GezimSejdiu GezimSejdiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @agarwalpratikkumar ,

thank you for your contribution! I just did a bit of review and feel free to comment on my comments.
Please, provide some basic unit-tests for your functionality before we consider merging the PR.

Best regards,

Copy link
Member

@GezimSejdiu GezimSejdiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @agarwalpratikkumar ,

thanks a lot for the change.

b.t.w. I haven't seen any uni-test which covers some of the functionality provided here? Please, could you add some basic unit-test such that we can go further with the PR.

Many thanks in advance.

Best regards,

@GezimSejdiu GezimSejdiu added this to the 0.8 milestone Jan 15, 2020
@GezimSejdiu GezimSejdiu added the ML label Feb 19, 2020
…ependency of graphframes and test file for unit-testing
@GezimSejdiu
Copy link
Member

Hi @agarwalpratikkumar ,

thanks a lot for the update. I can see that Travis CI is still complaining about the graph-frames dependency: https://travis-ci.org/SANSA-Stack/SANSA-ML/builds/652965675#L5810 . Would it be possible that you try to fix it before we merge the PR.

Best regards,

@agarwalpratikkumar
Copy link
Author

Hi @GezimSejdiu
I added graph-frames as an external jar in the eclipse but I was unable to push it on git. Can you please tell me how can I add this jar on git.

<dependency>
<groupId>graphframes</groupId>
<artifactId>graphframes</artifactId>
<version>0.7.0-spark2.4-s_2.11</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for adding the graphframes dependency. The project seems to build now. Consider using the latest version i.e. 0.8.0 and also format it :) -- i.e. align with other dependency lists.

Copy link
Member

@LorenzBuehmann LorenzBuehmann May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to add a Maven repo, nobody wants to add local Jars in the project classpath manually nowadays.

It is also mentioned on the GraphFrames Maven artifact page:

Note: this artifact it located at SparkPackages repository (https://dl.bintray.com/spark-packages/maven/)

That means, we have to add

<repository>
	<id>SparkPackages</id>
	<name>Repo for Spark packages</name>
	<url>https://dl.bintray.com/spark-packages/maven/</url>
		<snapshots>
			<enabled>false</enabled>
		</snapshots>
</repository>


/*
*
* Clustering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give a bit of a better description i.e name of the cluster and what id does.


def getOnlyPredicates(parsedTriples: RDD[(String, String, Object)]): RDD[(String, String)] = {
return parsedTriples.map(f => {
val key = f._1 + "" // Subject is the key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you are adding "" in the end? For converting them to a string? Can you just use toString() if needed?

}

def minHashLSH(featuredData_Df: DataFrame): (MinHashLSHModel, DataFrame) = {
val mh = new MinHashLSH().setNumHashTables(3).setInputCol("features").setOutputCol("HashedValues")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this number 3 fixed when we setNumHashTables or it can also be adjusted based on the use-case? We shouldn' use any hard-coded values.

}

def approxSimilarityJoin(model: MinHashLSHModel, transformedData_Df: DataFrame): Dataset[_] = {
val threshold = 0.40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How and who define this threshold i.e. to be 0.40? I consider that all these values can be done via a configuration file or even as arguments when running the algorithm.

val connected_components = g.connectedComponents.run()

//Removing the graphframes checkpoint directory
val file_path = Paths.get(dir_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to work in the cluster? or the dir_path of the checkpoint is always considered to be held on the driver? We should use any file system configurations as soon as it is needed to be distributed across the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would somebody change the checkpoint dir locally? Also, the whole param isn't documented in the constructor. In my opinion, that should be configured during Spark setup or maybe if you really need it during Spark submit but not in the code.

import net.sansa_stack.rdf.spark.io._

val path = getClass.getResource("/Cluster/testSample.nt").getPath
val graphframeCheckpointPath = "/sansa-ml-spark_2.11/src/main/scala/net/sansa_stack/ml/spark/clustering/algorithms/graphframeCheckpoints"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use the getResource(..) method for the checkpoints as well or you explicitly need to write it on the class level? which is a bit strange to me. Shall we consider moving it as a resource? when it is needed to be generated?

val triples = spark.rdf(Lang.NTRIPLES)(path)

val cluster_ = new LocalitySensitiveHashing(spark, triples, graphframeCheckpointPath)
cluster_.run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get the centroids? and do some validations insted of just run method? and perform assert(true) it will always pass right ;) . Check here for inspiration: https://github.com/apache/spark/tree/master/mllib/src/test/scala/org/apache/spark/ml/clustering

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the run() method doesn't return anything, so I'm wondering how somebody would use this code? I mean, it computes clusters and also their quality, but nothing is returned nor written to disk somewhere ... that can't be useful in its current state.

Copy link
Member

@LorenzBuehmann LorenzBuehmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code is too far away from being merged to develop.

I'll keep it short as Gezim already pointed to particular parts in the code.

  1. this algorithm will only work on data with URIs having some kind of human-readable data as local name - yes, it works for some dataset, but it will clearly be weird for e.g. Wikidata with all it's QXX URIs
  2. related to point 1, during workflow only the local names and literal values are forwared, so, how could anybody make use of the final result? I mean, how to get back the clusters with the original RDF resources, i.e. URIs and literals?
  3. also, the algorithm does some computations but never returns any result nor does it write anything to disk - or I might have missed that part in the code

By the way, is there any document regarding the workflow? I'm referring to the research idea behind the code.

@JensLehmann JensLehmann modified the milestones: 0.8, 0.9 Jul 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants