Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace #2996

Closed
wants to merge 2 commits into from

Conversation

jegonzal
Copy link
Contributor

This simple patch filters out extra whitespace entries.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22442 has started for PR 2996 at commit e028e84.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22442 has finished for PR 2996 at commit e028e84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22442/
Test FAILed.

@@ -76,7 +76,7 @@ object MLUtils {
.map { line =>
val items = line.split(' ')
val label = items.head.toDouble
val (indices, values) = items.tail.map { item =>
val (indices, values) = items.tail.filter( pair => !pair.isEmpty ).map { item =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor (but since we need to run Jenkins again): filter(_.nonEmpty) is more readable

@jegonzal
Copy link
Contributor Author

Not sure why it failed the test. Is this an issue with the testing framework?

@jegonzal
Copy link
Contributor Author

The following implementation seems a bit more efficient but is needlessly complicated.

  val parsed = sc.textFile(path, minPartitions)
      .map(_.trim)
      .filter(line => !(line.isEmpty || line.startsWith("#")))
      .map { line =>
        val items = line.split(' ')
        val label = items.head.toDouble
        // Count the number of empty values                                                                                                                                                                     
        var i = 1
        var emptyValues = 0
        while (i < items.size) {
          if (items(i).isEmpty) emptyValues += 1
          i += 1
        }
        // Determine the number of non-zero entries                                                                                                                                                             
        val nnzs = items.size - 1 - emptyValues
        // Compute the indices                                                                                                                                                                                  
        val indices = new Array[Int](nnzs)
        val values = new Array[Double](nnzs)
        i = 1
        var j = 0
        while (i < items.size) {
          if (!items(i).isEmpty) {
            val indexAndValue = items(i).split(':')
            indices(j) = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.                                                                                                                      
            values(j) = indexAndValue(1).toDouble
            j += 1
          }
          i += 1
        }
        // assert(j == nnzs)                                                                                                                                                                                    
        // val (indices, values) = items.tail.filter( pair => !pair.isEmpty ).map { item =>                                                                                                                     
        //   val indexAndValue = item.split(':')                                                                                                                                                                
        //   val index = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.                                                                                                                      
        //   val value = indexAndValue(1).toDouble                                                                                                                                                              
        //   (index, value)                                                                                                                                                                                     
        // }.unzip                                                                                                                                                                                              
        (label, indices, values)
      }

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22494 has started for PR 2996 at commit e0227ab.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22494 has finished for PR 2996 at commit e0227ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22494/
Test PASSed.

@asfgit asfgit closed this in c7ad085 Oct 30, 2014
@mengxr
Copy link
Contributor

mengxr commented Oct 30, 2014

LGTM. Merged into master. If the performance gain is worth the extra code complexity, we can switch to the new implementation. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants