Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation on tunneled connection #115

Closed
bAndie91 opened this issue Jul 30, 2018 · 17 comments
Closed

Performance degradation on tunneled connection #115

bAndie91 opened this issue Jul 30, 2018 · 17 comments
Assignees

Comments

@bAndie91
Copy link

I've experienced performance degradation when upgraded from 2.2.2 to 2.3.7.
see measurements in attachment which was made by Spark application calling spark.catalog.listTables().
newer WD is 3 times slower impacting the ssh-tunneled connections (see highlighted rows) the most.
image

how much can it be eliminated?

@patduin
Copy link
Contributor

patduin commented Jul 30, 2018

Any chance you can narrow down the version range? That would really help.

@rambrus
Copy link
Member

rambrus commented Aug 1, 2018

@bAndie91 , @patduin: I did some investigation and found this:

Test Case 1 - listTables on non-tunneled connection

wd version run-1 run-2
2.3.0 20 s 19 s
2.3.1 20 s 23 s
2.3.2 25 s 24 s
2.3.3 36 s 32 s
2.3.4 28 s 25 s
2.3.5 44 s 56 s
2.3.6 48 s 46 s

image

Test Case 2 - listTables on tunneled connection

wd version run-1 run-2
2.3.0 5 m 37 s 4 m 52 s
2.3.1 4 m 59 s 5 m 04 s
2.3.2 5 m 15 s 5 m 11 s
2.3.3 7 m 27 s 7 m 12 s
2.3.4 5 m 14 s 5 m 15 s
2.3.5 13 m 02 s 13 m 27 s
2.3.6 12 m 33 s 13 m 07 s

image

Summary
I had 2 runs for both test cases, the test durations are pretty much consistent between runs and we can observe ~150% performance degradation between 2.3.4 and 2.3.5 releases.

IMPORTANT: The performance degradation does not seem to be specific to tunneled connection, the same trend can be observed in both cases.

@patduin
Copy link
Contributor

patduin commented Aug 1, 2018

Excellent work!
Got one more questions in your WD configuration are all metastores reachable and responding or is one of them down?

@rambrus
Copy link
Member

rambrus commented Aug 1, 2018

@bAndie91 can you please answer the question above?

I also checked the spark logs if they can contain any unusual error.
I have seen this error many times in the logs starting from v2.3.3:
image

This might be also interesting.

@patduin
Copy link
Contributor

patduin commented Aug 1, 2018

I think I know what is going on. A change I made related to #73 . Does an extra call to verify the connection is open.

@patduin
Copy link
Contributor

patduin commented Aug 1, 2018

Can I ask you to try and build/run this branch: https://github.com/HotelsDotCom/waggle-dance/tree/issue-115
I suspect the changed line is what is causing the issue or at least I want to rule this out.
I'll also make an internal ticket for us to setup performance tests as we should be catching these issues. Apologies for that.

@rambrus
Copy link
Member

rambrus commented Aug 1, 2018

@patduin Sure, will check that branch and get back to you with the results.

@bAndie91
Copy link
Author

bAndie91 commented Aug 1, 2018

@patduin all the metastore connections are AVAILABLE during the tests run.

@rambrus
Copy link
Member

rambrus commented Aug 2, 2018

@bAndie91 , @patduin

Re-run the test cases on the version built from issue-115 branch and added the results to the charts.

Test Case 1 - listTables on non-tunneled connection

image

Test Case 2 - listTables on tunneled connection

image

I can confirm that the fix resolves the performance degradation issue.

Also the RetryingMetaStoreClient:184 - MetaStoreClient lost connection. Attempting to reconnect. warning disappeared from the logs.

@patduin
Copy link
Contributor

patduin commented Aug 2, 2018

ok thanks really helpful! We'll need to find some other way to fix #73 without introducing the performance hit. Not sure yet how but at least we know what is going on :)

@rambrus
Copy link
Member

rambrus commented Aug 2, 2018

cheers, let us know when the fix is available, we are happy to take a quick look at the performance.

@patduin
Copy link
Contributor

patduin commented Aug 2, 2018

Will do and thanks!

@patduin
Copy link
Contributor

patduin commented Aug 8, 2018

@rambrus I've updated the branch, I've managed to avoid the issue for normal connections but you'll see the degradation in tunneled connections still. I haven't found a way to work around this without sacrificing functionality. Would be great if you could test this. We could at least release this and if the performance is a big issue focus on that in some future PR.

@rambrus
Copy link
Member

rambrus commented Aug 9, 2018

@patduin sure, will take a look and get back to you with results.

@rambrus
Copy link
Member

rambrus commented Aug 10, 2018

@patduin : executed the tests on 4791baf and added the result to the chart.

image

image

I can see some performance degradation in both cases, but it's not so critical than in v2.3.5.

@patduin
Copy link
Contributor

patduin commented Aug 13, 2018

yeah I can't really account for that. We merged the PR with the changes and try to make a release this week.

@teabot teabot changed the title performance degradation on tunneled connection Performance degradation on tunneled connection Aug 15, 2018
@patduin
Copy link
Contributor

patduin commented Aug 21, 2018

This is adressed in 2.4.2 release, if the performance is still an issue please reopen or open a new ticket, closing this.

@patduin patduin closed this as completed Aug 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants