-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19066][SPARKR][Backport-2.1]:LDA doesn't set optimizer correctly #16623
Conversation
Could you change the PR title to
|
Changed. Thanks! |
@yanboliang ported it to 2.1. Thanks! |
Test build #71527 has finished for PR 16623 at commit
|
Test build #71542 has finished for PR 16623 at commit
|
LGTM |
merged to 2.1. thanks! |
## What changes were proposed in this pull request? Back port the fix to SPARK-19066 to 2.1 branch. ## How was this patch tested? Unit tests Author: [email protected] <[email protected]> Closes #16623 from wangmiao1981/bugport.
@wangmiao1981 could you close this PR - backport doesn't get closed automatically. |
Close this PR as it has been merged to 2.1. Thanks! |
@felixcheung Just saw this was backported to 2.1.1. Since this is a fairly significant behavioral change, I recommend we revert this backport. I could imagine workloads working with EM but failing with online LDA, and we should be careful about having such failures in patch versions (2.1.0->2.1.1). Let's wait until the next major release (2.2.0) instead. I've also seen other PRs adding public APIs to SparkR for 2.1.1. I understand that SparkR is still fairly experimental, but I recommend we start treating it like other parts of Spark in terms of API stability. We can talk about exceptions when they are really necessary. Can you please revert this patch? Thanks! |
Surely. This is a small enough change (one line) that we could revert easily.
But just so I understand, this isn't just a behavior change per se - this is the API not doing that is documented. Isn't this something we should fix in a patch release?
|
Hm, true, this is a weird case, where it is somewhere between a behavior change and a bug fix. You're right---let's not revert this patch. I do worry about other patches in SparkR; I'll try be more vigilant about API stability for SparkR for future patches. Thanks! |
@jkbradley Sure - I understand you perspective on the API state, and we share the same view. To be clear we should avoid breaking API changes and new API should be in minor release and not patch release. But there needs to be exception to the rule - SparkR is no where close to ready or stable and there are many gaps we need to fill as soon as possible. Or we might as well abandon it. Here's the full list of every single commit in R in branch-2.1 since its release. Blocker for R package release Blocker, API doesn't work Minor changes, usability on running SparkR as a package Gap - no way to manage partition, no API to do that Gap - usability, very un-R like - "new API" operators are already there, we are extending them Yes, somewhat questionable, new parameter but fixing an issue |
Thanks for cc'ing me on this - I think @jkbradley has a good point that we should be a bit more explicit in discussing when / why we backport changes in SparkR. While we have not declared it a stable interface, I think the number of users who are using it is large enough now that there is some expectation of stability. I think we can easily separate out the fixes we backport into a couple of buckets -- changes like 7763b0b which are internal changes and don't affect the API / are bug fixes fixing the behavior of an existing API. My reading of our versioning policy is that these are expected in a feature release. The second group are changes that either add or modify an API (we should never backport something that removes an API). For the modifications, lets explicitly discuss in the JIRA / Github PR as to why this is necessary / a good idea and also if its source compatible (e.g., 06e77e0 is source compatible) The other thing we could do is add JIRA labels like regression / bug-fix / api-modify to each JIRA that gets backported. Does that sound like something that would be useful @jkbradley @felixcheung ? |
Thanks for the comments. I definitely agree with many of your combined statements:
If I'm a user of SparkR, I agree I'd expect it to be less stable than the rest of Spark. But apparent instability from changing APIs can also scare away users from putting SparkR into actual use. It's a balance, but I'd prefer to err on the side of stability. I do like the idea of tagging backports, but it's Ok with me not to as long as there is proper explanation of the reason within the JIRA. |
What changes were proposed in this pull request?
Back port the fix to SPARK-19066 to 2.1 branch.
How was this patch tested?
Unit tests