Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-7422][MLLIB] Add argmax to Vector, SparseVector #6112

Closed
wants to merge 21 commits into from

Conversation

GeorgeDittmar
Copy link
Contributor

Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.

@GeorgeDittmar
Copy link
Contributor Author

Associated JIRA https://issues.apache.org/jira/browse/SPARK-7422

var maxIdx = indices(0)
var maxValue = values(0)

foreachActive { (i, v) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct if all nonzeros are negative and there exist inactive (zero) entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"there exist inactive (zero) entries" are you meaning we have values in the sparse vector that are zero but are set to not active? I thought that the sparse vector had you define what indices have active elements and that indices.size must equal the size of the values in that vector so I am not sure how you get into the state I think you are describing. Do you have an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait are you getting at say I have a sparse vector with zero in it and an assigned index? Like indices = (1,2,3) values = (-1,0,-.5)? If thats the case then yes I now see what you are saying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug still exists. Take a sparse vector val sv = SparseVector(5, Array(1, 3), Array(-1, -5)) for example, sv.argmax should return one from {0, 2, 4} because 0 is the max value in this vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added unit tests to cover this and they appear to be passing so maybe I am totally misunderstanding the bug you are discussing? Or more likely I am misunderstanding the usage of the sparse vector implementation in regards to the inactive vs active elements?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't explain this clearly. All inactive values in a sparse vector are zeros. The edge case here is that zero could be the max value of the entries. For example, [-1.0, 0.0, -3.0].argmax == 1 but 0.0 doesn't appear in foreachActive if the sparse vector is SparseVector(3, Array(0, 2), Array(-1.0, -3.0)). If we only look at the active values, the argmax would be 0 as -1.0 is the max among active values. So we need to cover the following cases if all active values are negative:

  1. if the number of active entries are the same as vector size (i.e., no inactive entries), use the current max and its index,
  2. if there are inactive entries, find one and output its index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah it sounds like I misunderstood the inactive vs active node concept. I'll get a fix in for that. I was assuming inactive meant we wanted to just ignore it and do the argmax over the active portions.

@GeorgeDittmar
Copy link
Contributor Author

New commit should handle the case @mengxr was talking about.

@GeorgeDittmar
Copy link
Contributor Author

Latest commit does not fix this yet just initial work towards the solution.

@GeorgeDittmar
Copy link
Contributor Author

@mengxr The latest commit should handle all cases now. Not sure why the commit doesnt have my message attached to it though...

// look for inactive values incase all active node values are negative
if(size != values.size && maxValue < 0){
maxIdx = calcInactiveIdx(indices(0))
maxValue = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line can be removed, since only maxIdx is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I more kept it in there for clarity incase anyone is debugging through the code and can more easily understand what the associated idx and val are. But i can remove if its just too much clutter.

-1
} else {

//grab first active index and value by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but looks like this can be removed.

@GeorgeDittmar
Copy link
Contributor Author

I think these last changes SHOULD catch all the corner cases lol. Hopefully I thought of all the cases to check for in the unit tests though wouldnt surprise me if I missed one still.

@GeorgeDittmar
Copy link
Contributor Author

@MechCoder @mengxr Any thoughts on this latest commit?

}

// look for inactive values in case all active node values are negative
if(size != values.size && maxValue <= 0){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after if

@MechCoder
Copy link
Contributor

After the minor comments LGTM.

cc @mengxr

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36704 has finished for PR 6112 at commit b22af46.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GeorgeDittmar
Copy link
Contributor Author

Just saw it failed the style check. I'll look into that.

}

// look for inactive values in case all active node values are negative
if (size != values.size && maxValue <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can implement this way:

var k = 0
while (k < indices.length && indices(k) == k) {
  k += 1
}
maxIdx = if (maxValue == 0.0 && k > maxIdx) maxIdx else k

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

@GeorgeDittmar You might also need to update MimaExcludes. See 34d448d for example.

@GeorgeDittmar
Copy link
Contributor Author

@mengxr Ok i'll look over all that this evening.

@GeorgeDittmar
Copy link
Contributor Author

@mengxr is the MimaExcludes used for keeping builds clean between versions?

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36900 has finished for PR 6112 at commit 42341fb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36902 has finished for PR 6112 at commit 5fd9380.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GeorgeDittmar
Copy link
Contributor Author

I need to look into these MiMa tests it looks like.

@GeorgeDittmar
Copy link
Contributor Author

@mengxr I added what I think was needed to the MimaExcludes but not sure if I did it correctly or not.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37324 has finished for PR 6112 at commit 2ea6a55.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GeorgeDittmar
Copy link
Contributor Author

Not sure why unit tests are failing. when I run them locally for the code in VectorSuite it runs all green. I'll keep trying locally.

@mengxr
Copy link
Contributor

mengxr commented Jul 16, 2015

@GeorgeDittmar I will try to fix it and send you a PR.

@GeorgeDittmar
Copy link
Contributor Author

@mengxr Sounds good. I'll keep trying to sort out wahts going on. looks like when I run the test script locally it is now failing out so hoping I can get a more detailed log there.

@mengxr
Copy link
Contributor

mengxr commented Jul 17, 2015

@GeorgeDittmar I sent you a PR at GeorgeDittmar#1.

@SparkQA
Copy link

SparkQA commented Jul 18, 2015

Test build #37694 has finished for PR 6112 at commit 3e0a939.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jul 18, 2015

test this please

@SparkQA
Copy link

SparkQA commented Jul 18, 2015

Test build #37722 has finished for PR 6112 at commit 3e0a939.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GeorgeDittmar
Copy link
Contributor Author

@mengxr Why did it not pass the first time after I merged your pr but now passes?

@asfgit asfgit closed this in 3f7de7d Jul 20, 2015
@mengxr
Copy link
Contributor

mengxr commented Jul 20, 2015

LGTM. Merged into master. Thanks! It didn't pass the test because of some other flaky test. My updates didn't affect the correctness, just did some optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants