High Availability deployment #2037

engelgabriel · 2016-08-09T19:07:42Z

See

RocketChat/Rocket.Chat#520
RocketChat/Rocket.Chat#847
RocketChat/Rocket.Chat#1867
RocketChat/Rocket.Chat#2964

engelgabriel · 2016-08-10T00:34:08Z

https://docs.mongodb.com/manual/reference/connection-string/

geekgonecrazy · 2016-08-18T17:51:02Z

RocketChat/Rocket.Chat#3540 (comment) We need to ensure we mention keeping time in sync between instances also.

srihas619 · 2017-08-09T09:05:15Z

inorder to run two instances of rocket chat on two VMs (web1 and web2) and with a loadbalancer infront of them (haproxy1), do I need to take care of setting up a shared storage for the session info. I have a seperate mongo DB replica setup with another three Vms: so if the session info stored in mongo; I dont need any shared storage I suppopse; please suggest me how I should move from this point

geekgonecrazy · 2017-08-10T01:54:31Z

if you have mongo set up in replicaset mode, you do not need anything else for shared session storage

srihas619 · 2017-08-10T08:22:29Z

@geekgonecrazy thanks for your response.
so, do I need to enable OpLog URL? what does exactly this OpLog do? sorry I am quite new to mongodb; I just want to setup a clustered rocket chat .

richardwlu · 2017-08-10T15:14:40Z

@geekgonecrazy And to add on to @srihas619's question, does the MONGO_OPLOG_URL need to point to the location of the Primary?

I currently have 4 instances of RC running on one server and 3 Mongo replicas and my
MONGO_OPLOG_URL is mongodb://localhost:27017/local
and
MONGO_URL is mongodb://localhost:27017,mongochat02:27017,mongochat03:27017/rocketchat?replicaSet=001-rs&readPreference=primaryPreferred&w=majority

However, I will have 8 instances of RC running between 2 servers while adding 2 additional replica set members on the new server(s) (to be mongochat04 and mongochat05) and I am trying to determine if I need to change MONGO_OPLOG_URL and MONGO_URL on either servers.

srihas619 · 2017-08-10T15:23:24Z

@richardwlu I configured MONGO_OPLOG_URL as mongodb://mongo1:27017,mongo2:27017,mongo3:27017/local?rs0 (rs0 is my replicaset) and it worked. but still I don't know the exact purpose of it; but when OpLog is enabled, rocket chat webs know themselves that they are clustered (Identified by observation)

richardwlu · 2017-08-10T15:33:12Z

@srihas619 Thanks for the tip. What is your value of MONGO_URL?

And the OpLog, according to the docs:

The oplog (operations log) is a special capped collection that keeps a rolling record of all operations that modify the data stored in your databases. MongoDB applies database operations on the primary and then records the operations on the primary’s oplog. The secondary members then copy and apply these operations in an asynchronous process. All replica set members contain a copy of the oplog, in the local.oplog.rs collection, which allows them to maintain the current state of the database.

https://docs.mongodb.com/manual/core/replica-set-oplog/

srihas619 · 2017-08-10T15:57:41Z

@richardwlu thanks for the link to docs. It explains clearly. My MONGO_URL is configured as mongodb://mongo1:27017,mongo2:27017,mongo3:27017/rocketchat?rs0. Have I configured things in a correct way or? I will be glad if you can review :)

richardwlu · 2017-08-10T18:29:00Z

@srihas619 I think we will need to bring in @geekgonecrazy to help verify the questions above for us :)

geekgonecrazy · 2017-08-10T19:14:29Z

Just like with the MONGO_URL you will want to add all nodes into the MONGO_OPLOG_URL incase a primary election happens. But it will only actually be tailing the oplog on the primary node.

geekgonecrazy · 2017-08-10T19:19:11Z

also as far as naming used in the connection string. You will want to make sure to use the same name that the nodes identify them selves as in the replicaset.

localhost for example is something I would avoid using when you have a multiple node mongo replicaset.

If you attempt to connect and the primary is localhost it will address attempt to connect to localhost and address it as such. If internally its referencing its self as something else, you will have issues.

Also the other mongo nodes will try to lookup localhost when trying to talk to this peer, it will of course always resolve to its self. So will cause all kinds of issues.

tl;dr Good practice to always use a hostname thats reachable by all other nodes in replicaset

richardwlu · 2017-08-10T20:01:44Z

@geekgonecrazy Thank you for clearing this up!

richardwlu · 2017-08-15T21:01:09Z

@geekgonecrazy Just to clarify, is it necessary to specify the replica set name by adding replicaSet=001-rs to the end of the MONGO_URL and MONGO_OPLOG_URL?

What I intend to have is:
MONGO_URL = "mongodb://den02vmchat01:27017,mongochat02:27017,mongochat03:27017,ash01vmchat01:27017,ash01vmchat02:27017/rocketchat?replicaSet=001-rs&readPreference=primaryPreferred&w=majority"

MONGO_OPLOG_URL = "mongodb://den02vmchat01:27017,mongochat02:27017,mongochat03:27017,ash01vmchat01:27017,ash01vmchat02:27017/local?replicaSet=001-rs"

geekgonecrazy · 2017-08-15T22:48:00Z

@richardwlu I don't know that its absolutely necessary. But i typically always do this

Looks like what I would use 👍

richardwlu · 2017-09-07T21:02:46Z

@geekgonecrazy After editing the following values for each instance (I have 8 total instances running on 2 servers with 4 on each), we have noticed an issue where users are not receiving desktop notifications and alerts consistently (more failed than not). Would you happen to know why this would occur and if it is related to the mongodb config?

We are on version 0.55.0 (older I know), but everything was running fine prior to me adding 4 instances and editing the mongodb. The version has stayed the same.

MONGO_URL = "mongodb://den02vmchat01:27017,mongochat02:27017,mongochat03:27017,ash01vmchat01:27017,ash01vmchat02:27017/rocketchat?replicaSet=001-rs&readPreference=primaryPreferred&w=majority"

MONGO_OPLOG_URL = "mongodb://den02vmchat01:27017,mongochat02:27017,mongochat03:27017,ash01vmchat01:27017,ash01vmchat02:27017/local?replicaSet=001-rs"

geekgonecrazy · 2017-10-24T20:00:36Z

@richardwlu sorry got a bit behind on github notifications 😁

If you are getting issues like that, typically because the instances cannot talk to each other. Usually something with instance ip or firewall conditions. They need to be able to talk otherwise when it fires event only users on same instance as the one that fired the event will receive it.

richardwlu · 2017-11-11T03:22:31Z

@geekgonecrazy No worries, found out it was the INSTANCE_IP

AFrangopoulos · 2018-03-12T21:01:08Z

@geekgonecrazy @Sing-Li @georgios Are you aware of any RC use-cases with very high concurrency (say, 100k + concurrent users)? Most things I have read were much smaller in scale than this.
After reading the doc page on scaling by adding RC instances (using mongo RS and a RP). In your research, have you found scaling like this to be linear or is there a large drop-off at a certain point? Most things I have read were mostly formula based, ie: " After doing X, we were able to handle 4X the number of Users". I have not been able to find a ballpark estimation of the number of users per instance, given some generic setup. Thank you for your time.

geekgonecrazy · 2018-03-30T06:21:24Z

Ulimit increased, node heap increased and good hardware you can do some amazing stuff. :). At that size of deployment I'd say reach out and talk to us.

AFrangopoulos · 2018-03-30T15:11:05Z

@geekgonecrazy I assume you mean to contact you via support? I think I shot an email out to sales and support a day or so ago. If there is a more fluid way of contacting RC that would be great -- maybe meet on your chat server for a brief discussion?

Load Testing: We started doing baseline load testing with no subscriptions and just sending messages and we get 100% usage around 60 rocketchat_messages/second. Writes/CPU usage seems to be the largest bottleneck -- the amount of users doesn't seem to really have much affect. We tested the same throughput with varying number of users and it had negligible affect on the system. Our setup involves oplog tailing and the hardware as of now is greater than what RC documented as "minimal specs". If we can have a brief discussion as stated above, we can get more specific if you have other questions.

Oplog Tailing vs RedisOplog: From what I have read, oplog tailing is a very expensive task when there are a lot of writes and redisOplog solution will relieve a lot of this CPU stress. Has Rocket Chat looked into this solution and if so, do you have any data on the results? I am currently trying to setup redisOplog with oplog tailing disabled on another machine but running into snags. Hopefully I get that running soon with Theodor's help.

Lastly, when you mentioned "node heap increased", did you mean the process memory limit or maybe, giving more RAM to mongo?

Update: The tests described above was just for a single instance. We are testing our server (12 cores) to have 10 instances. Very early testing showed 150 messages(writes)/second to take 65% CPU. It peaked at over 75% from bursts. Is it normal for the application to be make 6-7x more disk read / writes than mongo does? Or is this a possible misconfig on my end?

Thanks,
Aleko

sr258 · 2018-05-02T12:56:19Z

Have you had any conclusive results on how RC scales at your level of users @AFrangopoulos ? I'm evaluating RC for a use case with more than 1 million registered users and hundreds of thousands of concurrent users. Is there any documentation or experience on scaling RC to this level?

AFrangopoulos · 2018-05-02T18:43:29Z

Without getting redisOplog integrated, it can't scale to large numbers of concurrent users (did not have the time to spend on adding this solution to RC, so we abandoned it). The more you scale horizontally, the more your gains diminish due to oplog tailing (and I imagine some other things). Also, it seems the number of packages are also starving the app for resources from what I could tell.

I suspect that if RC were to integrate the redis-oplog feature successfully, it could scale to a very large number of concurrent users (unless there are other bottlenecks that appear after oplog tailing is removed from the equation). Lastly, if you do choose to use RC and try scale it will not be cost-effective -- and eventually you will hit a point where scaling stops and you will likely need to find a different solution.

This is all based on our load/performance testing. It is not fact per se, but I haven't seen anything out there that contradicts our conclusions. Hope this helps! I urge the RC devs to try to integrate the redis-oplog package to see this applications full potential. GL

geekgonecrazy · 2018-05-02T20:02:37Z

I know we talked a bit in the support channel. But curious what specifically is pointing to oplog as the limiting factor here?

We are definitely experimenting with redis oplog and a few others trying to overall increase performance.

sr258 · 2018-05-03T05:16:58Z

@AFrangopoulos Thanks for your answer, even if it is not what I'd hoped. Are you willing / allowed to hand out the load-tests?
@geekgonecrazy Do you think we could talk about your assessment on how RC scales and what would need to be done for RC to work at the number of users we have?

sr258 · 2018-06-14T13:27:15Z

@AFrangopoulos Can you provide some more detail on the load testing you've done (what techniques you've used etc.)? My institution is really interested in setting up their own load testing.

AFrangopoulos · 2018-06-19T15:14:58Z

@sr258 I used both meteor down and meteor-load-testing by allaning. I found the latter to be more consistent in our tests. Our main concern was 'writes' & having 20k+ concurrent users. Scaling horizontally just wasn't cutting it -- diminished returns the more you scaled. Was not cost effective. Also, it seems RC has a LOT of packages and there is a continuous fight for CPU.

Our goal was to reach 1k writes/second through scaling and it couldn't within reason ($/# boxes needed). I think somewhere on these forums pertaining to RC/Meteor I have a detailed writeup of the several setups we tried to get the best performance.

In the end, we concluded this product couldn't scale to what we needed. It seems RedisOplog would be a great candidate to fix the scaling issues. GL to you

Rodriq · 2022-06-24T14:49:00Z

Docs for high availability here https://docs.rocket.chat/quick-start/deploying-rocket.chat/rapid-deployment-methods/docker-and-docker-compose/docker-containers/high-availability-install

engelgabriel changed the title ~~Create documentation about High Availability deployment~~ High Availability deployment Aug 18, 2016

Faria1212 transferred this issue from RocketChat/docs-old Apr 20, 2021

Faria1212 transferred this issue from RocketChat/developer-docs Apr 26, 2022

Faria1212 assigned Rodriq Jun 8, 2022

Rodriq closed this as completed Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Availability deployment #2037

High Availability deployment #2037

engelgabriel commented Aug 9, 2016

engelgabriel commented Aug 10, 2016

geekgonecrazy commented Aug 18, 2016

srihas619 commented Aug 9, 2017

geekgonecrazy commented Aug 10, 2017

srihas619 commented Aug 10, 2017

richardwlu commented Aug 10, 2017 •

edited

Loading

srihas619 commented Aug 10, 2017 •

edited

Loading

richardwlu commented Aug 10, 2017

srihas619 commented Aug 10, 2017 •

edited

Loading

richardwlu commented Aug 10, 2017

geekgonecrazy commented Aug 10, 2017

geekgonecrazy commented Aug 10, 2017 •

edited

Loading

richardwlu commented Aug 10, 2017

richardwlu commented Aug 15, 2017

geekgonecrazy commented Aug 15, 2017 •

edited

Loading

richardwlu commented Sep 7, 2017

geekgonecrazy commented Oct 24, 2017

richardwlu commented Nov 11, 2017

AFrangopoulos commented Mar 12, 2018

geekgonecrazy commented Mar 30, 2018

AFrangopoulos commented Mar 30, 2018 •

edited

Loading

sr258 commented May 2, 2018 •

edited

Loading

AFrangopoulos commented May 2, 2018

geekgonecrazy commented May 2, 2018

sr258 commented May 3, 2018

sr258 commented Jun 14, 2018

AFrangopoulos commented Jun 19, 2018

Rodriq commented Jun 24, 2022

High Availability deployment #2037

High Availability deployment #2037

Comments

engelgabriel commented Aug 9, 2016

engelgabriel commented Aug 10, 2016

geekgonecrazy commented Aug 18, 2016

srihas619 commented Aug 9, 2017

geekgonecrazy commented Aug 10, 2017

srihas619 commented Aug 10, 2017

richardwlu commented Aug 10, 2017 • edited Loading

srihas619 commented Aug 10, 2017 • edited Loading

richardwlu commented Aug 10, 2017

srihas619 commented Aug 10, 2017 • edited Loading

richardwlu commented Aug 10, 2017

geekgonecrazy commented Aug 10, 2017

geekgonecrazy commented Aug 10, 2017 • edited Loading

richardwlu commented Aug 10, 2017

richardwlu commented Aug 15, 2017

geekgonecrazy commented Aug 15, 2017 • edited Loading

richardwlu commented Sep 7, 2017

geekgonecrazy commented Oct 24, 2017

richardwlu commented Nov 11, 2017

AFrangopoulos commented Mar 12, 2018

geekgonecrazy commented Mar 30, 2018

AFrangopoulos commented Mar 30, 2018 • edited Loading

sr258 commented May 2, 2018 • edited Loading

AFrangopoulos commented May 2, 2018

geekgonecrazy commented May 2, 2018

sr258 commented May 3, 2018

sr258 commented Jun 14, 2018

AFrangopoulos commented Jun 19, 2018

Rodriq commented Jun 24, 2022

richardwlu commented Aug 10, 2017 •

edited

Loading

srihas619 commented Aug 10, 2017 •

edited

Loading

srihas619 commented Aug 10, 2017 •

edited

Loading

geekgonecrazy commented Aug 10, 2017 •

edited

Loading

geekgonecrazy commented Aug 15, 2017 •

edited

Loading

AFrangopoulos commented Mar 30, 2018 •

edited

Loading

sr258 commented May 2, 2018 •

edited

Loading