-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-7422][MLLIB] Add argmax to Vector, SparseVector #6112
Conversation
Associated JIRA https://issues.apache.org/jira/browse/SPARK-7422 |
var maxIdx = indices(0) | ||
var maxValue = values(0) | ||
|
||
foreachActive { (i, v) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct if all nonzeros are negative and there exist inactive (zero) entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"there exist inactive (zero) entries" are you meaning we have values in the sparse vector that are zero but are set to not active? I thought that the sparse vector had you define what indices have active elements and that indices.size must equal the size of the values in that vector so I am not sure how you get into the state I think you are describing. Do you have an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait are you getting at say I have a sparse vector with zero in it and an assigned index? Like indices = (1,2,3) values = (-1,0,-.5)? If thats the case then yes I now see what you are saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug still exists. Take a sparse vector val sv = SparseVector(5, Array(1, 3), Array(-1, -5))
for example, sv.argmax
should return one from {0, 2, 4}
because 0
is the max value in this vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added unit tests to cover this and they appear to be passing so maybe I am totally misunderstanding the bug you are discussing? Or more likely I am misunderstanding the usage of the sparse vector implementation in regards to the inactive vs active elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't explain this clearly. All inactive values in a sparse vector are zeros. The edge case here is that zero could be the max value of the entries. For example, [-1.0, 0.0, -3.0].argmax == 1
but 0.0
doesn't appear in foreachActive
if the sparse vector is SparseVector(3, Array(0, 2), Array(-1.0, -3.0))
. If we only look at the active values, the argmax would be 0 as -1.0
is the max among active values. So we need to cover the following cases if all active values are negative:
- if the number of active entries are the same as vector size (i.e., no inactive entries), use the current max and its index,
- if there are inactive entries, find one and output its index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yeah it sounds like I misunderstood the inactive vs active node concept. I'll get a fix in for that. I was assuming inactive meant we wanted to just ignore it and do the argmax over the active portions.
… the calculation.
New commit should handle the case @mengxr was talking about. |
Latest commit does not fix this yet just initial work towards the solution. |
@mengxr The latest commit should handle all cases now. Not sure why the commit doesnt have my message attached to it though... |
// look for inactive values incase all active node values are negative | ||
if(size != values.size && maxValue < 0){ | ||
maxIdx = calcInactiveIdx(indices(0)) | ||
maxValue = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line can be removed, since only maxIdx is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I more kept it in there for clarity incase anyone is debugging through the code and can more easily understand what the associated idx and val are. But i can remove if its just too much clutter.
… to cover corner cases
-1 | ||
} else { | ||
|
||
//grab first active index and value by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry but looks like this can be removed.
I think these last changes SHOULD catch all the corner cases lol. Hopefully I thought of all the cases to check for in the unit tests though wouldnt surprise me if I missed one still. |
@MechCoder @mengxr Any thoughts on this latest commit? |
} | ||
|
||
// look for inactive values in case all active node values are negative | ||
if(size != values.size && maxValue <= 0){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after if
After the minor comments LGTM. cc @mengxr |
Test build #36704 has finished for PR 6112 at commit
|
Just saw it failed the style check. I'll look into that. |
} | ||
|
||
// look for inactive values in case all active node values are negative | ||
if (size != values.size && maxValue <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can implement this way:
var k = 0
while (k < indices.length && indices(k) == k) {
k += 1
}
maxIdx = if (maxValue == 0.0 && k > maxIdx) maxIdx else k
@GeorgeDittmar You might also need to update MimaExcludes. See 34d448d for example. |
@mengxr Ok i'll look over all that this evening. |
@mengxr is the MimaExcludes used for keeping builds clean between versions? |
Test build #36900 has finished for PR 6112 at commit
|
Test build #36902 has finished for PR 6112 at commit
|
I need to look into these MiMa tests it looks like. |
@mengxr I added what I think was needed to the MimaExcludes but not sure if I did it correctly or not. |
Test build #37324 has finished for PR 6112 at commit
|
Not sure why unit tests are failing. when I run them locally for the code in VectorSuite it runs all green. I'll keep trying locally. |
@GeorgeDittmar I will try to fix it and send you a PR. |
@mengxr Sounds good. I'll keep trying to sort out wahts going on. looks like when I run the test script locally it is now failing out so hoping I can get a more detailed log there. |
@GeorgeDittmar I sent you a PR at GeorgeDittmar#1. |
update argmax impl
Test build #37694 has finished for PR 6112 at commit
|
test this please |
Test build #37722 has finished for PR 6112 at commit
|
@mengxr Why did it not pass the first time after I merged your pr but now passes? |
LGTM. Merged into master. Thanks! It didn't pass the test because of some other flaky test. My updates didn't affect the correctness, just did some optimization. |
Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.