[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace #2996

jegonzal · 2014-10-29T07:36:24Z

This simple patch filters out extra whitespace entries.

SparkQA · 2014-10-29T07:39:51Z

Test build #22442 has started for PR 2996 at commit e028e84.

This patch merges cleanly.

SparkQA · 2014-10-29T08:33:32Z

Test build #22442 has finished for PR 2996 at commit e028e84.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-29T08:33:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22442/
Test FAILed.

mengxr · 2014-10-29T17:17:43Z

mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala

@@ -76,7 +76,7 @@ object MLUtils {
      .map { line =>
        val items = line.split(' ')
        val label = items.head.toDouble
-        val (indices, values) = items.tail.map { item =>
+        val (indices, values) = items.tail.filter( pair => !pair.isEmpty ).map { item =>


minor (but since we need to run Jenkins again): filter(_.nonEmpty) is more readable

jegonzal · 2014-10-30T00:00:11Z

Not sure why it failed the test. Is this an issue with the testing framework?

jegonzal · 2014-10-30T00:03:37Z

The following implementation seems a bit more efficient but is needlessly complicated.

  val parsed = sc.textFile(path, minPartitions)
      .map(_.trim)
      .filter(line => !(line.isEmpty || line.startsWith("#")))
      .map { line =>
        val items = line.split(' ')
        val label = items.head.toDouble
        // Count the number of empty values                                                                                                                                                                     
        var i = 1
        var emptyValues = 0
        while (i < items.size) {
          if (items(i).isEmpty) emptyValues += 1
          i += 1
        }
        // Determine the number of non-zero entries                                                                                                                                                             
        val nnzs = items.size - 1 - emptyValues
        // Compute the indices                                                                                                                                                                                  
        val indices = new Array[Int](nnzs)
        val values = new Array[Double](nnzs)
        i = 1
        var j = 0
        while (i < items.size) {
          if (!items(i).isEmpty) {
            val indexAndValue = items(i).split(':')
            indices(j) = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.                                                                                                                      
            values(j) = indexAndValue(1).toDouble
            j += 1
          }
          i += 1
        }
        // assert(j == nnzs)                                                                                                                                                                                    
        // val (indices, values) = items.tail.filter( pair => !pair.isEmpty ).map { item =>                                                                                                                     
        //   val indexAndValue = item.split(':')                                                                                                                                                                
        //   val index = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.                                                                                                                      
        //   val value = indexAndValue(1).toDouble                                                                                                                                                              
        //   (index, value)                                                                                                                                                                                     
        // }.unzip                                                                                                                                                                                              
        (label, indices, values)
      }

SparkQA · 2014-10-30T00:05:17Z

Test build #22494 has started for PR 2996 at commit e0227ab.

This patch merges cleanly.

SparkQA · 2014-10-30T01:17:52Z

Test build #22494 has finished for PR 2996 at commit e0227ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-30T01:17:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22494/
Test PASSed.

mengxr · 2014-10-30T07:12:58Z

LGTM. Merged into master. If the performance gain is worth the extra code complexity, we can switch to the new implementation. Thanks!

fixing whitespace bug in loadLibSVMFile when parsing libSVM files

e028e84

mengxr reviewed Oct 29, 2014
View reviewed changes

improving readability

e0227ab

asfgit closed this in c7ad085 Oct 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace #2996

[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace #2996

jegonzal commented Oct 29, 2014

SparkQA commented Oct 29, 2014

SparkQA commented Oct 29, 2014

AmplabJenkins commented Oct 29, 2014

mengxr Oct 29, 2014

jegonzal commented Oct 30, 2014

jegonzal commented Oct 30, 2014

SparkQA commented Oct 30, 2014

SparkQA commented Oct 30, 2014

AmplabJenkins commented Oct 30, 2014

mengxr commented Oct 30, 2014

[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace #2996

[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace #2996

Conversation

jegonzal commented Oct 29, 2014

SparkQA commented Oct 29, 2014

SparkQA commented Oct 29, 2014

AmplabJenkins commented Oct 29, 2014

mengxr Oct 29, 2014

Choose a reason for hiding this comment

jegonzal commented Oct 30, 2014

jegonzal commented Oct 30, 2014

SparkQA commented Oct 30, 2014

SparkQA commented Oct 30, 2014

AmplabJenkins commented Oct 30, 2014

mengxr commented Oct 30, 2014