-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace #2996
Conversation
Test build #22442 has started for PR 2996 at commit
|
Test build #22442 has finished for PR 2996 at commit
|
Test FAILed. |
@@ -76,7 +76,7 @@ object MLUtils { | |||
.map { line => | |||
val items = line.split(' ') | |||
val label = items.head.toDouble | |||
val (indices, values) = items.tail.map { item => | |||
val (indices, values) = items.tail.filter( pair => !pair.isEmpty ).map { item => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor (but since we need to run Jenkins again): filter(_.nonEmpty)
is more readable
Not sure why it failed the test. Is this an issue with the testing framework? |
The following implementation seems a bit more efficient but is needlessly complicated. val parsed = sc.textFile(path, minPartitions)
.map(_.trim)
.filter(line => !(line.isEmpty || line.startsWith("#")))
.map { line =>
val items = line.split(' ')
val label = items.head.toDouble
// Count the number of empty values
var i = 1
var emptyValues = 0
while (i < items.size) {
if (items(i).isEmpty) emptyValues += 1
i += 1
}
// Determine the number of non-zero entries
val nnzs = items.size - 1 - emptyValues
// Compute the indices
val indices = new Array[Int](nnzs)
val values = new Array[Double](nnzs)
i = 1
var j = 0
while (i < items.size) {
if (!items(i).isEmpty) {
val indexAndValue = items(i).split(':')
indices(j) = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.
values(j) = indexAndValue(1).toDouble
j += 1
}
i += 1
}
// assert(j == nnzs)
// val (indices, values) = items.tail.filter( pair => !pair.isEmpty ).map { item =>
// val indexAndValue = item.split(':')
// val index = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.
// val value = indexAndValue(1).toDouble
// (index, value)
// }.unzip
(label, indices, values)
} |
Test build #22494 has started for PR 2996 at commit
|
Test build #22494 has finished for PR 2996 at commit
|
Test PASSed. |
LGTM. Merged into master. If the performance gain is worth the extra code complexity, we can switch to the new implementation. Thanks! |
This simple patch filters out extra whitespace entries.