Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct conversion of Spark model stages into MLeap local models #261

Merged
merged 12 commits into from
Apr 6, 2019

Conversation

tovbinm
Copy link
Collaborator

@tovbinm tovbinm commented Apr 3, 2019

Related issues

  1. Previously the conversion of Spark model stages into MLeap local models was done without providing all the necessary metadata which resulted in exceptions. For example when converting Spark StringIndexerModel into MLeap model it was expecting ml_attr metadata to be present in transformed Dataframe.
  2. Reflecting apply method on MLeap models did not work correctly, since many models had more than one apply method present.

Describe the proposed solution

  1. Provide transformed Dataframe with all the necessary metadata to allow MLeap local models creation.
  2. Get rid of reflection for apply method on MLeap models and explicitly convert MLeap models into scoring methods.
  3. Added tests

Describe alternatives you've considered
N/A

@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

Merging #261 into master will decrease coverage by 0.3%.
The diff coverage is 29.62%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #261      +/-   ##
==========================================
- Coverage   86.67%   86.36%   -0.31%     
==========================================
  Files         317      318       +1     
  Lines       10403    10447      +44     
  Branches      322      552     +230     
==========================================
+ Hits         9017     9023       +6     
- Misses       1386     1424      +38
Impacted Files Coverage Δ
...om/salesforce/op/local/OpWorkflowRunnerLocal.scala 100% <ø> (ø) ⬆️
.../op/features/types/FeatureTypeSparkConverter.scala 99.11% <100%> (+0.01%) ⬆️
.../com/salesforce/op/local/MLeapModelConverter.scala 5.12% <5.12%> (ø)
...com/salesforce/op/local/OpWorkflowModelLocal.scala 97.36% <92.3%> (-2.64%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 394b4bd...1a3d9fa. Read the comment docs.

@tovbinm tovbinm requested a review from wsuchy April 3, 2019 18:13
@leahmcguire
Copy link
Collaborator

So now we have to spool up spark to use this sparkless scoring?? Could you get the necessary info another way? eg. serialize the dataframe schema along with the model?

case m: VectorSlicerModel => x => m.apply(x(0).asInstanceOf[Vector])
case m: WordLengthFilterModel => x => m.apply(x(0).asInstanceOf[Seq[String]])
case m: WordToVectorModel => x => m.apply(x(0).asInstanceOf[Seq[String]])
case m => throw new RuntimeException(s"Unsupported MLeap model: ${m.getClass.getName}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so every wrapped spark stage has to be in this list? we should add that the the docs on wrapping...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently added all the stages from features package. We can also add models from classification, regression and recommendation packages, but we already have the first two of them covered as our own OpTransformer stages, so I did not see much of a point adding them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to your question - for right now I think we have everything covered, except recommenders, which I am planning to add once we are ready.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a todo with the classification and regression models? I dont know that this will be much use without them...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding those is very easy. the thing is we already have classification and regression models as OpTransformers so MLeap won’t be used to run them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point :-)

@tovbinm
Copy link
Collaborator Author

tovbinm commented Apr 3, 2019

@leahmcguire we used to do before when loading spark stages. Now I just explicitly exposed an ability to users to control spark session lifecycle.

The only way around avoiding spark session is to export our models into MLeap format. Which is indeed a possibility and I am open to discuss it.

As of right now, local scoring assumes the model format as we have it now (i.e. json + parquet files).

@tovbinm tovbinm merged commit 3a1a2d8 into master Apr 6, 2019
@tovbinm tovbinm deleted the mt/mleap-models branch April 6, 2019 01:27
@tovbinm tovbinm mentioned this pull request Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants