Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-7422][MLLIB] Add argmax to Vector, SparseVector #6112

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
04677af
initial work on adding argmax to Vector and SparseVector
May 11, 2015
3cffed4
Adding unit tests for argmax functions for Dense and Sparse vectors
May 12, 2015
df9538a
Added argmax to sparse vector and added unit test
May 12, 2015
4526acc
Merge branch 'master' of github.com:apache/spark into SPARK-7422
May 13, 2015
eeda560
Fixing SparseVector argmax function to ignore zero values while doing…
May 15, 2015
af17981
Initial work fixing bug that was made clear in pr
dittmarg May 22, 2015
f21dcce
commit
GeorgeDittmar May 25, 2015
b1f059f
Added comment before we start arg max calculation. Updated unit tests…
GeorgeDittmar May 29, 2015
3ee8711
Fixing corner case issue with zeros in the active values of the spars…
GeorgeDittmar Jun 1, 2015
ee1a85a
Cleaning up unit tests a bit and modifying a few cases
GeorgeDittmar Jun 1, 2015
d5b5423
Fixing code style and updating if logic on when to check for zero values
GeorgeDittmar Jun 9, 2015
ac53c55
changing dense vector argmax unit test to be one line call vs 2
GeorgeDittmar Jun 9, 2015
aa330e3
Fixing some last if else spacing issues
GeorgeDittmar Jun 9, 2015
f2eba2f
Cleaning up unit tests to be fewer lines
GeorgeDittmar Jun 9, 2015
b22af46
Fixing spaces between commas in unit test
GeorgeDittmar Jun 10, 2015
42341fb
refactoring arg max check to better handle zero values
GeorgeDittmar Jul 9, 2015
5fd9380
fixing style check error
GeorgeDittmar Jul 9, 2015
98058f4
Merge branch 'master' of github.com:apache/spark into SPARK-7422
GeorgeDittmar Jul 15, 2015
2ea6a55
Added MimaExcludes for Vectors.argmax
GeorgeDittmar Jul 15, 2015
127dec5
update argmax impl
mengxr Jul 17, 2015
3e0a939
Merge pull request #1 from mengxr/SPARK-7422
GeorgeDittmar Jul 18, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 50 additions & 5 deletions mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,12 @@ sealed trait Vector extends Serializable {
toDense
}
}

/**
* Find the index of a maximal element. Returns the first maximal element in case of a tie.
* Returns -1 if vector has length 0.
*/
def argmax: Int
}

/**
Expand Down Expand Up @@ -588,11 +594,7 @@ class DenseVector(val values: Array[Double]) extends Vector {
new SparseVector(size, ii, vv)
}

/**
* Find the index of a maximal element. Returns the first maximal element in case of a tie.
* Returns -1 if vector has length 0.
*/
private[spark] def argmax: Int = {
override def argmax: Int = {
if (size == 0) {
-1
} else {
Expand Down Expand Up @@ -717,6 +719,49 @@ class SparseVector(
new SparseVector(size, ii, vv)
}
}

override def argmax: Int = {
if (size == 0) {
-1
} else {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line.

var maxIdx = 0
var maxValue = if(indices(0) != 0) 0.0 else values(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment on why this initialization is done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


foreachActive { (i, v) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct if all nonzeros are negative and there exist inactive (zero) entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"there exist inactive (zero) entries" are you meaning we have values in the sparse vector that are zero but are set to not active? I thought that the sparse vector had you define what indices have active elements and that indices.size must equal the size of the values in that vector so I am not sure how you get into the state I think you are describing. Do you have an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait are you getting at say I have a sparse vector with zero in it and an assigned index? Like indices = (1,2,3) values = (-1,0,-.5)? If thats the case then yes I now see what you are saying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug still exists. Take a sparse vector val sv = SparseVector(5, Array(1, 3), Array(-1, -5)) for example, sv.argmax should return one from {0, 2, 4} because 0 is the max value in this vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added unit tests to cover this and they appear to be passing so maybe I am totally misunderstanding the bug you are discussing? Or more likely I am misunderstanding the usage of the sparse vector implementation in regards to the inactive vs active elements?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't explain this clearly. All inactive values in a sparse vector are zeros. The edge case here is that zero could be the max value of the entries. For example, [-1.0, 0.0, -3.0].argmax == 1 but 0.0 doesn't appear in foreachActive if the sparse vector is SparseVector(3, Array(0, 2), Array(-1.0, -3.0)). If we only look at the active values, the argmax would be 0 as -1.0 is the max among active values. So we need to cover the following cases if all active values are negative:

  1. if the number of active entries are the same as vector size (i.e., no inactive entries), use the current max and its index,
  2. if there are inactive entries, find one and output its index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah it sounds like I misunderstood the inactive vs active node concept. I'll get a fix in for that. I was assuming inactive meant we wanted to just ignore it and do the argmax over the active portions.

if (v > maxValue) {
maxIdx = i
maxValue = v
}
}

// look for inactive values incase all active node values are negative
if(size != values.size && maxValue < 0){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail for this terrible edge case.

val a = new SparseVector(5, Array(0, 3), Array(-1.0, 0.0))
a.argmax
Int = 3

where actually it is 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically its not a failure but it does break the convention of returning the first instance of a max value so good find. I'll see how best to handle it. Should be simple fix.

maxIdx = calcInactiveIdx(indices(0))
maxValue = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line can be removed, since only maxIdx is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I more kept it in there for clarity incase anyone is debugging through the code and can more easily understand what the associated idx and val are. But i can remove if its just too much clutter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

}
maxIdx
}
}

/**
* Calculates the first instance of an inactive node in a sparse vector and returns the Idx
* of the element.
* @param idx starting index of computation
* @return index of first inactive node or -1 if it cannot find one
*/
private[SparseVector] def calcInactiveIdx(idx: Int): Int ={
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it might be better just to write this a while loop above.

maxIdx = {
    var j = 0
    while j < size {
        if (!indices.contains(j)) j
        j += 1
   }
}
maxIdx

or something similar, since it has lesser lines of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way I had it previously is a little bit cleaner from a scala perspective, though I am still getting used to the language. Trying with a loop and then if checks like above is forcing me to keep track of some extra over head variables. I cant seem just do the below without having the rest of the else statement return something as well. Otherwise I get a return type exception of Unit instead of Int.

// cant seem to do this
if (isAwesome){
  return 1
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. sorry about that. How about something similar?

val maxIdx = {
  var indexPresent = true
  var j = 0 
  while (indexPresent) {
    if (!indices.contains(j)) indexPresent = false
    j += 1
  }
  j - 1
}

I have no opposition to the present function, except for calcInactiveIdx should be calcFirstInactiveIdx and I find it slightly odd that a function that calculates the first inactive index should have a parameter ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure about the best way. cc @mengxr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was going back and forth on if I wanted to pass in a idx param or not. It would be nice in case we want to say find an inactive value after a given index but thats probably coding for the future which tends to be messy. I'll remove it for now and if anyone else has any other opinions we can go from there.

I dunno I think I am just partial to recursive functions but I can give yours a try still. Really up for whatever best fits the spark code style etc.

if(idx < size){
if(!indices.contains(idx)){
idx
}else{
calcInactiveIdx(idx+1)
}
}else{
-1
}
}

}

object SparseVector {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,11 +63,53 @@ class VectorsSuite extends FunSuite {
assert(vec.toArray.eq(arr))
}

test("dense argmax"){
val vec = Vectors.dense(Array.empty[Double]).asInstanceOf[DenseVector]
val noMax = vec.argmax
assert(noMax === -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I think this can be combined into a single line, here and elsewhere

assert(vec.max === -1)


val vec2 = Vectors.dense(arr).asInstanceOf[DenseVector]
val max = vec2.argmax
assert(max === 3)

val vec3 = Vectors.dense(Array(-1.0, 0.0, -2.0, 1.0)).asInstanceOf[DenseVector]
val max2 = vec3.argmax
assert(max === 3)
}

test("sparse to array") {
val vec = Vectors.sparse(n, indices, values).asInstanceOf[SparseVector]
assert(vec.toArray === arr)
}

test("sparse argmax"){
val vec = Vectors.sparse(0,Array.empty[Int],Array.empty[Double]).asInstanceOf[SparseVector]
val noMax = vec.argmax
assert(noMax === -1)

val vec2 = Vectors.sparse(n,indices,values).asInstanceOf[SparseVector]
val max = vec2.argmax
assert(max === 3)

val vec3 = Vectors.sparse(5,Array(2, 4),Array(1.0,-.7))
val max2 = vec3.argmax
assert(max2 === 2)

// check for case that sparse vector is created with only negative vaues {0.0, 0.0,-1.0, -0.7, 0.0}
val vec4 = Vectors.sparse(5,Array(2, 3),Array(-1.0,-.7))
val max3 = vec4.argmax
assert(max3 === 0)

// check for case that sparse vector is created with only negative vaues {-1.0, 0.0, -0.7, 0.0, 0.0}
val vec5 = Vectors.sparse(5,Array(0, 3),Array(-1.0,-.7))
val max4 = vec5.argmax
assert(max4 === 1)

val vec6 = Vectors.sparse(5,Array(0, 1, 2),Array(-1.0, -.025, -.7))
val max5 = vec6.argmax
assert(max5 === 3)
}

test("vector equals") {
val dv1 = Vectors.dense(arr.clone())
val dv2 = Vectors.dense(arr.clone())
Expand Down