Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackExchange.Redis: Version 2.x Planning #528

Closed
11 of 12 tasks
NickCraver opened this issue Nov 20, 2016 · 15 comments
Closed
11 of 12 tasks

StackExchange.Redis: Version 2.x Planning #528

NickCraver opened this issue Nov 20, 2016 · 15 comments
Assignees
Milestone

Comments

@NickCraver
Copy link
Collaborator

NickCraver commented Nov 20, 2016

Marc and I have been considering a few things long-term which are breaking changes and need to come in the form of a major version change. I'm kicking up this issue to track all of that in one spot. This will get a lot of love and rapid changes as we go through the issue list and jot down 2.x plans. I'm posting it here for public discussion so everyone can follow along and pitch ideas.

Major changes:

API/Misc TODOs:

Maybes:

Why?

  • The .NET 4.0 drop is because the backwards-compat libraries for async in .NET 4.0 are falling further and further behind the proper BCL. Things like async await on semaphores for more efficient locking aren't there. It's also just a lot of #ifdef code to maintain.
  • While Pipelines will do some enhancements for performance in our case (though all allocations help). What we're after is stability and better handling of many network connectivity issues. You can preview some of what it looks like here.

If you have ideas or issues I've missed, please post. We can't keep up with all of the projects we have and the netstandard, project.json, and .csproj moves are extra demanding at the moment. We appreciate all the help the community provides. Thank you!

@paveliak
Copy link
Contributor

paveliak commented Dec 7, 2016

I would be nice to have a support for continuous tracing so once can plug in ETW tracing provider (or such) and be able to see StackExchange.Redis events in PerfView or similar tool.

@MichaCo
Copy link
Contributor

MichaCo commented Dec 23, 2016

How about a good configurable retry mechanic for any operations. I currently have to implement that myself and usually you guys know best which exceptions should be treated transient and which ones should always bubble up ;)
Circuit breaker would be the next thing which could help. Maybe it is too much, but take a look at libraries like https://github.com/App-vNext/Polly which provide a lot of functionality around it. Tbh I'm not using that library, I hardcoded retries wherever needed ^^, but the way how it works looks pretty nice.

@NickCraver
Copy link
Collaborator Author

@MichaCo the problem is we can't even imagine how that would work. It just varies so much not by the operation type but by that specific operation. Some things are fine to repeat, others are totally breaking to do so.

Almost no global approaches work - it has to be an app-level concern. The only thing I can possibly see here would be on the specific call have some retry bits, via the command flags perhaps? Even then, at which stage it failed isn't clear.

As an example: maybe your integer increment did reach redis, but the response was dropped. Or maybe it didn't get there at all! Is the retry the first or second actual increment? We can't know in many cases...so the retry is really a per caller concern.

We're all ears on thoughts here, but have no ideas with traction, which would be a precursor for any code plans.

@MichaCo
Copy link
Contributor

MichaCo commented Dec 23, 2016

@NickCraver thanks for the response, makes sense to me.
Totally get that it might be pretty hard to do globally, you are probably right and this should
stay an app-level concern.

I'm mostly concerned about actual exceptions happening during a call to the server.
Currently there are 3 different exceptions, RedisServerException, RedisConnectionException and TimeoutException which all can occur in case the server died, or master<->slave changes or nodes in a cluster fail and fall back.., and timeouts also occur if there is a lot of load on the server/network.

Just curious but, is

As an example: maybe your integer increment did reach redis, but the response was dropped

this really a thing? Mhm...

I'm primarily using lua scripts, pretty predictable what happens, either the script ran, or not ;)

@NickCraver
Copy link
Collaborator Author

@MichaCo unfortunately, that's not true. There's 5 network-level steps in the process for any command:

  1. We send to the socket
  2. Network trip to redis
  3. redis performs the op
  4. Network trip back
  5. We read the result

The failure could happen at any one of these. All we know is we didn't get a result back in some cases. So even in the LUA case, we can't be sure it ran or not. We know there was a failure between step 1 and 5. If someone unplugged a network cable, was it on the way, or on the way back?

Hopefully that serves as a better illustration of the problem - if not let me know!

@Crozin
Copy link

Crozin commented Mar 3, 2017

Sentinel support would be a great improvement, any chance for this feature?

@grahamehorner
Copy link

@Crozin Sentinel support would be cool, and could provide some great additional features under Microsoft.AspNet[Core].HealthChecks😎

@NickCraver
Copy link
Collaborator Author

Adding #203 here.

@mgravell
Copy link
Collaborator

cross-referencing: #871

@NickCraver
Copy link
Collaborator Author

@JonCole Azure question: even with pipelines, I think the default sync timeout of 1000 may be a little low for most wide area networks (e.g. any cloud) - since the time spent in TCP resends and such isn't really a factor of even the fastest library. Would you agree or do you think it's a non-issue from experience thus far? I ask because we have talked about upping the default in v2 if that makes life better for everyone. I'd say raising it to 5 seconds even if we do so. Thoughts?

@JonCole
Copy link
Contributor

JonCole commented Jul 2, 2018

@NickCraver Yes, for applications that are accessing Redis from cross-region scenarios, having a larger timeout is probably good. There are certainly a subset of users that will want smaller timeout values, but making the default larger probably gives a better experience for customers who aren't as time sensitive.

@JonCole
Copy link
Contributor

JonCole commented Jul 2, 2018

You may want to consider increasing the connectTimeout as well. Even though most connections are established quickly, some users run into issues under unexpected conditions like high client or server CPU conditions. These conditions cause connection attempst to fail/timeout, which causes connection retries, which adds to CPU load. Rinse and repeat - you create a self-feeding negative cycle.

I would suggest at least 15 or 20 seconds as the default.

@dukecheng
Copy link

@NickCraver do we have a release date for 2.0 version? as current version have performance issue, I am expecting the new version to resolve the performance issue.

@mgravell
Copy link
Collaborator

@dukecheng we've fixed the last remaining blocker and we've been dogfooding internally. My plan is to do more dogfooding today, and I'd guess a RTM release either next week or the start of the week after (I'm going to be AFK a lot next week)

@NickCraver
Copy link
Collaborator Author

For everyone following along here: 2.0.495 (RTM) is now available on NuGet. We've been running the bits in production Stack Overflow for about a week now and have seen far fewer stability issues (the main goal of v2).

Release notes are up at: https://stackexchange.github.io/StackExchange.Redis/ReleaseNotes
NuGet up at: https://www.nuget.org/packages/StackExchange.Redis

Closing this out to cleanup, but please file an issue if you run into one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants