Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost MLeap bundle speed #833

Open
drei34 opened this issue Nov 16, 2022 · 5 comments
Open

XGBoost MLeap bundle speed #833

drei34 opened this issue Nov 16, 2022 · 5 comments

Comments

@drei34
Copy link

drei34 commented Nov 16, 2022

Hi, I have the question below from another repo that I think is no longer active, so I pasted it. Basically, I don't quite understand why with bigger data batch sizes MLeap XGBoost bundle seems to be running faster. I assume it is threading but unsure. Please let me know that and if I can turn off such optimizations, I am trying to compare with something that's unoptimized currently.

image

@jsleight
Copy link
Contributor

This will depend on which xgboost runtime you are using. We have two xgboost runtimes:

  • [default] dmlc xgboost which is C++ behind the scenes. DMLC xgboost does have threading (search xgboosts to figure out how to control that). But in addition to the threading dmlc xgboost's prediction performance is better for batch operations because a) we can streamline the jni transfer and b) xgboost's c++ uses avx instructions to literally batch computations.
  • XGBoost-Predictor which is a java native re-implementation of xgboost's inference engine. This typically has much better single row prediction times (because no jni transfers), but is less performant for batch prediction since it doesn't do threading or avx vectorizing.

See https://github.com/combust/mleap/tree/master/mleap-xgboost-runtime has details on how to swap between them

P.S. I'm guessing your chart is showing the stats per row and not the aggregate for batch size? I.e., the mean time for batch_size=20 is 0.625*20 in aggregate. It would be pretty surprising to me if predict(50_rows) completed faster than predict(1_row)

@drei34
Copy link
Author

drei34 commented Nov 16, 2022

Thanks! I ran for 1000 iterations for a fixed batch size, so for example 1000 iterations of batch size 1 took 1.05 * 1000 ms. For batch size 20, it was 0.625 * 1000 and for size 50 it was 0.468 * 1000. So yes, I'm showing predict(50_rows) < predict(1_rows) which is what is curious. This is not expected? Do you have a slack channel btw?

@drei34
Copy link
Author

drei34 commented Nov 16, 2022

To be clear ... I am making a Transformer in java from the mleap bundle and then just taking the prediction time for different data frames I generate of a fixed size. And this is giving me this counterintuitive result ...

image

image

@jsleight
Copy link
Contributor

I definitely would not expect predict(50_rows) < predict(1_rows).

predict(50_rows) / 50 < predict(1_rows) would obviously make sense.

Only ideas I have are some weirdness in the benchmarking setup like cache warming, startup, etc.. If you're not already, then using jmh for benchmarking is usually helpful for eliminating that kind of noise.

@drei34
Copy link
Author

drei34 commented Nov 17, 2022

Right I also am a bit weirded out by this but also in production the worst latencies I saw seem to have been by requests which have a 1 feature row; more feature rows seems to do better (a large batch has better latency than 1 row, predict(50_rows) < predict(1_row)). So this is confirming what I see but it does not make sense and I'm trying to get an understanding of it ... Is it possible that the reason this is happening is that in the case when batches are small the number of threads that go up is large and then they "wait" to come down and this has some sort of inefficiency? I did not use jmh yet but I'm also loading another model and for this model when the number of rows grows the latency grows, which makes sense (predict(50_rows) > predict(1_rows)). The only thing I can come up with currently is the threading inside of the bundle has some optimization specific to larger batches and it's detrimental to smaller batches ... Can try jmh and come back or maybe a quick zoom?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants