Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

Split ELK role into standalone Elasticsearch and Kibana roles #1481

Merged
merged 5 commits into from
Jun 7, 2016

Conversation

ryane
Copy link
Contributor

@ryane ryane commented May 24, 2016

  • Installs cleanly on a fresh build of most recent master branch
  • Upgrades cleanly from the most recent release
  • Updates documentation relevant to the changes

This separates the ELK role into standalone Elasticsearch and Kibana roles. The ELK role is now just a meta role that includes the Elasticsearch, Kibana, and Logstash roles. This allows users to more easily deploy a standalone Elasticsearch cluster for purposes other than the standard Mantl log collection with the ELK stack. This can be used for the System Assurance Elasticsearch cluster.


Testing

Testing with the default configuration will require at least 4 worker nodes, each having at least 1 full CPU and 1 GB of memory available to Mesos. In addition, each worker node will need to have at least 5 GBs of free disk space.

Install an Elasticsearch cluster

ansible-playbook -e @security.yml addons/elasticsearch.yml

After several minutes, you should see:

  1. A healthy mantl/elasticsearch app in marathon

  2. A healthy mantl/elasticsearch-client app in marathon

  3. A running elasticsearch.mantl task running in Mesos. This is the Elasticsearch Mesos framework.

  4. 3 running elasticsearch-executor-mantl tasks running in Mesos. These are the 3 Elasticsearch nodes running in your cluster.

  5. An elasticsearch-client.mantl task running in Mesos. This is an Elasticsearch client node that acts as a smart load balancer for the Elasticsearch cluster. It will be listening on well-known Elasticsearch ports 9200 (http) and 9300 (transport). You can verify the health of the Elasticsearch cluster by running a command like:

    $ curl -s elasticsearch-client-mantl.service.consul:9200/_cluster/health | jq .
    {
      "cluster_name": "mantl",
      "status": "green",
      "timed_out": false,
      "number_of_nodes": 4,
      "number_of_data_nodes": 3,
      "active_primary_shards": 5,
      "active_shards": 15,
      "relocating_shards": 0,
      "initializing_shards": 0,
      "unassigned_shards": 0,
      "delayed_unassigned_shards": 0,
      "number_of_pending_tasks": 0,
      "number_of_in_flight_fetch": 0,
      "task_max_waiting_in_queue_millis": 0,
      "active_shards_percent_as_number": 100
    }
  6. The following healthy services registered in consul.

    • elasticsearch-mantl (the Elasticsearch Mesos framework)
    • elasticsearch-executor-mantl (the Elasticsearch nodes launched by the Mesos framework)
      • each service will also have a client_port and a transport_port tag that can be used to discover the corresponding ports
    • elasticsearch-client-mantl (the Elasticsearch client node)

    Consul can be used to discover the IPs and ports of the different services if needed. Otherwise, elasticsearch-client-mantl.service.consul:9200 is available as a convenient entry point into the cluster.

  7. The Elasticsearch Mesos framework UI is available via Mantl UI (requires browser refresh).

Install Kibana

ansible-playbook -e @security.yml addons/kibana.yml

After several minutes, you should see:

  1. A healthy mantl/kibana app in marathon
  2. A kibana.mantl task running in Mesos. This is the Kibana Mesos framework.
  3. A kibana-mantl.task running in Mesos. This is the actual Kibana application running in Mesos.
  4. The following healthy services registered in consul:
    • kibana-mantl (the Kibana Mesos framework)
    • kibana-mantl-task (the Kibana application)
  5. The Kibana UI is available via Mantl UI (requires browser refresh). By default, Kibana connects to an Elasticsearch client node identified by consul service named elasticsearch-client-mantl. You may see an error in the Kibana UI since the Elasticsearch cluster does not contain any indexes.

Uninstall Kibana

ansible-playbook -e @security.yml -e 'kibana_uninstall=true' addons/kibana.yml

After a few minutes, you should see that:

  1. The mantl/kibana app is no longer running in marathon.
  2. The kibana.mantl and kibana.mantl.task tasks should no longer be running in Mesos.
  3. The kibana-mantl and kibana-mantl-task services should no longer be registered in Consul.
  4. The Kibana UI should no longer be visible in Mantl UI (requires browser refresh).

Uninstall Elasticsearch

ansible-playbook -e @security.yml -e 'elasticsearch_uninstall=true elasticsearch_remove_data=true' addons/elasticsearch.yml

After a few minutes, you should see that:

  1. The mantl/elasticsearch app is no longer running in marathon.

  2. The mantl/elasticsearch-client app is no longer running in marathon.

  3. The elasticsearch.mantl, elasticsearch-executor-mantl, and elasticsearch-client.mantl tasks should no longer be running in Mesos.

  4. The elasticsearch-mantl, elasticsearch-executor-mantl, and elasticsearch-client-mantl services should no longer be registered in Consul.

  5. The Elasticsearch Mesos framework UI should no longer be visible in Mantl UI (requires browser refresh).

  6. This example includes elasticsearch_remove_data=true which will also remove the Elasticsearch data from every node. You can verify that the directory is removed with the following command:

    ansible all -s -m shell -a 'ls -al /var/lib/mesos/slave/elasticsearch/mantl'

    You should get No such file or directory for every node. You can also test without elasticsearch_remove_data set (or set to false) and those directories should still exist on a few of your worker nodes after the uninstall is complete.

Install the full ELK stack

ansible-playbook -e @security.yml addons/elk.yml

This is a meta role that installs the Elasticsearch, Kibana, and Logstash roles at one time. After several minutes, you should see that:

  1. An Elasticsearch search cluster is installed. See "Install an Elasticsearch cluster" for the Elasticsearch verification steps.
  2. Kibana is installed. See "Install Kibana" for the Kibana verification steps.
  3. Logstash should be running on every node (verify with systemctl status logstash locally on each node or with ansible)
  4. When you visit the Kibana UI, you should see that it is receiving logs from each node.

Uninstall the full ELK stack

ansible-playbook -e @security.yml -e 'elk_uninstall=true elasticsearch_remove_data=true' addons/elk.yml

After a few minutes, you should see:

  1. That everything included in the "Uninstall Kibana" and the "Uninstall Elasticsearch" sections were completed.

Install a custom Elasticsearch cluster

ansible-playbook -e @security.yml -e 'elasticsearch_nodes=4' addons/elasticsearch.yml

In this example, we are launching 4 Elasticsearch data nodes via the Mesos framework. You can verify everything in the "Install an Elasticsearch cluster" section. The only difference is that there should be 4 elasticsearch-executor-mantl tasks running in Mesos and visible in the Elasticsearch Mesos framework UI. View the Elasticsearch role documentation for all of the configuration variables. You can uninstall this cluster by running:

ansible-playbook -e @security.yml -e 'elasticsearch_uninstall=true elasticsearch_remove_data=true' addons/elasticsearch.yml

@tpolekhin
Copy link
Contributor

@ryane hello. Tried to install elasticsearch on a fresh mantl cluster. Scheduler went up and executors spawned well, but elasticsearch-client-mantl fails every 3 minutes and proxy don't work.
stderr:

+ echo 'attempt: 1'
+ sleep 30
+ wait_for_service
+ /usr/local/bin/consul-template -config /consul-template/config.d -log-level warn -wait 30s:60s -once -consul consul.service.consul:8500 -ssl -ssl-verify=false
2016/05/25 10:47:06 [WARN] (runner) disabling consul SSL verification
2016/05/25 10:47:06 [ERR] (view) "service(transport_port.elasticsearch-executor-mantl [any])" health services: error fetching: Get https://consul.service.consul:8500/v1/health/service/elasticsearch-executor-mantl?stale=&tag=transport_port&wait=60000ms: http: server gave HTTP response to HTTPS client
2016/05/25 10:47:06 [ERR] (runner) watcher reported error: health services: error fetching: Get https://consul.service.consul:8500/v1/health/service/elasticsearch-executor-mantl?stale=&tag=transport_port&wait=60000ms: http: server gave HTTP response to HTTPS client
Consul Template returned errors:
health services: error fetching: Get https://consul.service.consul:8500/v1/health/service/elasticsearch-executor-mantl?stale=&tag=transport_port&wait=60000ms: http: server gave HTTP response to HTTPS client+ grep discovery.zen.ping.unicast.hosts /usr/share/elasticsearch/config/elasticsearch.yml
+ '[' 2 -eq 5 ']'
+ echo 'waiting for transport_port.elasticsearch-executor-mantl service...'
+ echo 'attempt: 2'

stdout:

--container="mesos-25aa81dc-6108-4478-8421-4ef831b7e24d-S2.3d2dde0b-6fbb-4d22-a7ad-97681ecdc447" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slaves/25aa81dc-6108-4478-8421-4ef831b7e24d-S2/frameworks/25aa81dc-6108-4478-8421-4ef831b7e24d-0000/executors/mantl_elasticsearch-client.e1de1ca5-2265-11e6-b612-0242ef758ce7/runs/3d2dde0b-6fbb-4d22-a7ad-97681ecdc447" --stop_timeout="0ns"
--container="mesos-25aa81dc-6108-4478-8421-4ef831b7e24d-S2.3d2dde0b-6fbb-4d22-a7ad-97681ecdc447" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slaves/25aa81dc-6108-4478-8421-4ef831b7e24d-S2/frameworks/25aa81dc-6108-4478-8421-4ef831b7e24d-0000/executors/mantl_elasticsearch-client.e1de1ca5-2265-11e6-b612-0242ef758ce7/runs/3d2dde0b-6fbb-4d22-a7ad-97681ecdc447" --stop_timeout="0ns"
Registered docker executor on mantl-worker-002
Starting task mantl_elasticsearch-client.e1de1ca5-2265-11e6-b612-0242ef758ce7
waiting for transport_port.elasticsearch-executor-mantl service...
attempt: 0
waiting for transport_port.elasticsearch-executor-mantl service...
attempt: 1
waiting for transport_port.elasticsearch-executor-mantl service...
attempt: 2
waiting for transport_port.elasticsearch-executor-mantl service...
attempt: 3
waiting for transport_port.elasticsearch-executor-mantl service...
attempt: 4
transport_port.elasticsearch-executor-mantl not found.

mantl-control-01:

[cloud-user@mantl-control-01 ~]$ curl 'http://consul.service.consul:8500/v1/catalog/service/elasticsearch-executor-mantl'
[{"Node":"mantl-worker-001","Address":"10.10.10.68","ServiceID":"mesos-consul:10.10.10.68:elasticsearch-executor-mantl:4000","ServiceName":"elasticsearch-executor-mantl","ServiceTags":["CLIENT_PORT"],"ServiceAddress":"10.10.10.68","ServicePort":4000,"ServiceEnableTagOverride":false,"CreateIndex":1972,"ModifyIndex":1972},{"Node":"mantl-worker-001","Address":"10.10.10.68","ServiceID":"mesos-consul:10.10.10.68:elasticsearch-executor-mantl:4001","ServiceName":"elasticsearch-executor-mantl","ServiceTags":["TRANSPORT_PORT"],"ServiceAddress":"10.10.10.68","ServicePort":4001,"ServiceEnableTagOverride":false,"CreateIndex":1973,"ModifyIndex":1973},{"Node":"mantl-worker-004","Address":"10.10.10.67","ServiceID":"mesos-consul:10.10.10.67:elasticsearch-executor-mantl:4000","ServiceName":"elasticsearch-executor-mantl","ServiceTags":["CLIENT_PORT"],"ServiceAddress":"10.10.10.67","ServicePort":4000,"ServiceEnableTagOverride":false,"CreateIndex":1969,"ModifyIndex":1969},{"Node":"mantl-worker-004","Address":"10.10.10.67","ServiceID":"mesos-consul:10.10.10.67:elasticsearch-executor-mantl:4001","ServiceName":"elasticsearch-executor-mantl","ServiceTags":["TRANSPORT_PORT"],"ServiceAddress":"10.10.10.67","ServicePort":4001,"ServiceEnableTagOverride":false,"CreateIndex":1970,"ModifyIndex":1970},{"Node":"mantl-worker-005","Address":"10.10.10.66","ServiceID":"mesos-consul:10.10.10.66:elasticsearch-executor-mantl:4000","ServiceName":"elasticsearch-executor-mantl","ServiceTags":["CLIENT_PORT"],"ServiceAddress":"10.10.10.66","ServicePort":4000,"ServiceEnableTagOverride":false,"CreateIndex":1978,"ModifyIndex":1978},{"Node":"mantl-worker-005","Address":"10.10.10.66","ServiceID":"mesos-consul:10.10.10.66:elasticsearch-executor-mantl:4001","ServiceName":"elasticsearch-executor-mantl","ServiceTags":["TRANSPORT_PORT"],"ServiceAddress":"10.10.10.66","ServicePort":4001,"ServiceEnableTagOverride":false,"CreateIndex":1979,"ModifyIndex":1979}]

@ryane
Copy link
Contributor Author

ryane commented May 25, 2016

2016/05/25 10:47:06 [ERR] (view) "service(transport_port.elasticsearch-executor-mantl [any])" health services: error fetching: Get https://consul.service.consul:8500/v1/health/service/elasticsearch-executor-mantl?stale=&tag=transport_port&wait=60000ms: http: server gave HTTP response to HTTPS client
2016/05/25 10:47:06 [ERR] (runner) watcher reported error: health services: error fetching: Get https://consul.service.consul:8500/v1/health/service/elasticsearch-executor-mantl?stale=&tag=transport_port&wait=60000ms: http: server gave HTTP response to HTTPS client

It is failing to talk to consul to discover the elasticsearch cluster due to SSL issues. Are you using default Mantl security settings?

@tpolekhin
Copy link
Contributor

@ryane no, i set up mantl with ./security-setup --enable=false

@ryane
Copy link
Contributor Author

ryane commented May 25, 2016

ok looks like you found an issue with the elasticsearch-client app running when consul SSL is turned off. I created CiscoCloud/mantl-universe#32 with a fix. If you want to test it, you can just update the Consul KVP at mantl-install/repository/0/repo/packages/E/elasticsearch-client/0/marathon.json with the contents of https://raw.githubusercontent.com/CiscoCloud/mantl-universe/c733e688509f2f0f7b7847c2dd40d90c9e3b09d1/repo/packages/E/elasticsearch-client/1/marathon.json, delete the mantl/elasticsearch-client app from marathon, and then re-run the addons/elasticsearch.yml playbook.

Meanwhile, I'll setup a clean test on my end.

@tpolekhin
Copy link
Contributor

@ryane Is that by design -Xms256m -Xmx1g for 512mb container? :)

@ryane
Copy link
Contributor Author

ryane commented May 25, 2016

no just the default. we should expose it in the configuration. what would you suggest as a better default?

@tpolekhin
Copy link
Contributor

@ryane as long as this node will be actual search node for es cluster, i suggest it to have at least 1cpu and 1-2gb ram

@tpolekhin
Copy link
Contributor

@ryane seems like Kibana addon need this SSL fix too

+ echo 'attempt: 0'
+ sleep 10
Unable to launch health process: Only command health check is supported now.
+ wait_for_config
+ /usr/local/bin/consul-template -config /consul-template/config.d/kibana.cfg -log-level warn -wait 2s:10s -once -consul consul.service.consul:8500 -ssl -ssl-verify=false
2016/05/25 14:45:07 [WARN] (runner) disabling consul SSL verification
2016/05/25 14:45:07 [ERR] (view) "service(elasticsearch-client-mantl [any])" health services: error fetching: Get https://consul.service.consul:8500/v1/health/service/elasticsearch-client-mantl?stale=&wait=60000ms: http: server gave HTTP response to HTTPS client
2016/05/25 14:45:07 [ERR] (runner) watcher reported error: health services: error fetching: Get https://consul.service.consul:8500/v1/health/service/elasticsearch-client-mantl?stale=&wait=60000ms: http: server gave HTTP response to HTTPS client
Consul Template returned errors:
health services: error fetching: Get https://consul.service.consul:8500/v1/health/service/elasticsearch-client-mantl?stale=&wait=60000ms: http: server gave HTTP response to HTTPS client+ grep elasticsearch.url /opt/kibana/config/kibana.yml
+ '[' 1 -eq 6 ']'
+ echo 'waiting for Kibana configuration...'
+ cat /opt/kibana/config/kibana.yml
+ echo 'attempt: 1'

@ryane
Copy link
Contributor Author

ryane commented May 25, 2016

ok, working on these changes...

- can set JAVA_OPTS for elasticsearch and elasticsearch-client
- elasticsearch-client defaults to 1 cpu + 1 gb mem
- fixes issue where install tasks might not run on correct node
@ryane
Copy link
Contributor Author

ryane commented May 26, 2016

kibana should work when ssl is disabled now. + java_opts are configurable. if you want to test on an existing cluster, you should be able to resync the repository (run from a control node):

consul-cli kv-delete --recurse mantl-install
curl -XPOST http://localhost:18080/v2/apps/mantl-api/restart

let me know if you see anything else that needs tweaking.

@tpolekhin
Copy link
Contributor

@ryane looks good. Only "issue" that i see is with Kibana trying to display logstash-* index from elastic on load by default, but there isn't one, because i didn't installed logstash role. I think we should update Kibana role to load as default, and then move this pre-defined view load to logstash role. Is this possible? Thanks

@ryane
Copy link
Contributor Author

ryane commented May 26, 2016

Yep, good call, it should be possible. will post a new commit when ready

it is turned on when you install the elk addon but it is off you install
the standalone kibana addon
@ryane
Copy link
Contributor Author

ryane commented May 27, 2016

this is now controlled by the kibana_logstash_config variable. By default, the logstash config is not applied if you install the standalone kibana role but it does happen if you install the full elk addon.

@SergeyNosko
Copy link

@ryane, seems there is a feature\bug at kibana role. If you apply the role and then uninstall kibana the indexes added by kibana to elasticsearch are not flushed. Stumbled upon this during upgrade to last commit - "kibana logstash config is optional". Might be a good idea to add such option to uninstall to flush indexes at elasticsearch

@kbroughton
Copy link
Contributor

I've been playing with logstash.conf to support logging to s3 and more inputs. I noticed we are running logstash 1.5.3, pretty old.

But it would be nice to get x-pack as an optional install as well. Still alpha, but it has most of features required for production. https://www.elastic.co/v5

Could we make this update forward looking to include easy version bumping and x-pack support?

Also, i'm still getting kibana.mantl failures every hour or so on a 3 week old mantl deployment. I only have 3 workers so that might be a problem (4 are recommended?). And i believe destroying kibana in marathon may trigger the docker hangs i've experienced. Perhaps we could test that as well with the new refactor.

@ryane
Copy link
Contributor Author

ryane commented Jun 2, 2016

@SergeyNosko I'd want to be careful about deleting indexes or any other data on uninstall. Perhaps we could make it optional and not the default. Or, at the very least, document the process. Do you mind opening a separate issue for it?

@kbroughton we are going to be replacing the logstash agent on all nodes with filebeat. Another team is working on creating a logstash role (possibly a mesos framework) that can be used as a central place for processing before sending to Elasticsearch (see #1203). So, it's likely that you'll be able to do something like filebeat -> elasticsearch or filebeat -> logstash -> elasticsearch depending on what you want, your scale, etc.

x-pack looks nice but it also looks like it might not be compatible with the version of elasticsearch and kibana we are currently deploying. can you create another issue for that?

Finally, I wonder if your kibana instance is running out of memory. Are you using the default settings? Can you try increasing the amount of memory assigned to the application in marathon and see if you have better results?

@langston-barrett
Copy link
Contributor

langston-barrett commented Jun 7, 2016

Test results

Tested on AWS with security enabled, four m3.xlarge worker nodes, three m3.large control nodes, one m3.medium edge node, and one m3.large kubeworker.

Install an Elasticsearch cluster

  • A healthy mantl/elasticsearch app in marathon
  • A healthy mantl/elasticsearch-client app in marathon
  • A running elasticsearch.mantl task running in Mesos. This is the Elasticsearch Mesos framework.
  • 3 running elasticsearch-executor-mantl tasks running in Mesos. These are the 3 Elasticsearch nodes running in your cluster.
  • An elasticsearch-client.mantl task running in Mesos. This is an Elasticsearch client node that acts as a smart load balancer for the Elasticsearch cluster. It will be listening on well-known Elasticsearch ports 9200 (http) and 9300 (transport). You can verify the health of the Elasticsearch cluster by running a command like:
  • The following healthy services registered in consul.
    • elasticsearch-mantl (the Elasticsearch Mesos framework)
    • elasticsearch-executor-mantl (the Elasticsearch nodes launched by the Mesos framework)
      • each service will also have a client_port and a transport_port tag that can be used to discover the corresponding ports
    • elasticsearch-client-mantl (the Elasticsearch client node)
  • The Elasticsearch Mesos framework UI is available via Mantl UI (requires browser refresh).

Kibana

  • A healthy mantl/kibana app in marathon
  • A kibana.mantl task running in Mesos. This is the Kibana Mesos framework.
  • A kibana-mantl.task running in Mesos. This is the actual Kibana application running in Mesos.
  • The following healthy services registered in consul:
    • kibana-mantl (the Kibana Mesos framework)
    • kibana-mantl-task (the Kibana application)
  • The Kibana UI is available via Mantl UI (requires browser refresh). By default, Kibana connects to an Elasticsearch client node identified by consul service named elasticsearch-client-mantl. You may see an error in the Kibana UI since the Elasticsearch cluster does not contain any indexes.

Uninstall Kibana

  • The mantl/kibana app is no longer running in marathon.
  • The kibana.mantl and kibana.mantl.task tasks should no longer be running in Mesos.
  • The kibana-mantl and kibana-mantl-task services should no longer be registered in Consul.
  • The Kibana UI should no longer be visible in Mantl UI (requires browser refresh).

Uninstall Elasticsearch

  • The mantl/elasticsearch app is no longer running in marathon.
  • The mantl/elasticsearch-client app is no longer running in marathon.
  • The elasticsearch.mantl, elasticsearch-executor-mantl, and elasticsearch-client.mantl tasks should no longer be running in Mesos.
  • The elasticsearch-mantl, elasticsearch-executor-mantl, and elasticsearch-client-mantl services should no longer be registered in Consul.
  • The Elasticsearch Mesos framework UI should no longer be visible in Mantl UI (requires browser refresh).
  • This example includes elasticsearch_remove_data=true which will also remove the Elasticsearch data from every node.

ELK stack

  • An Elasticsearch search cluster is installed. See "Install an Elasticsearch cluster" for the Elasticsearch verification steps.
  • Kibana is installed. See "Install Kibana" for the Kibana verification steps.
  • Logstash should be running on every node (verify with systemctl status logstash locally on each node or with ansible)
  • When you visit the Kibana UI, you should see that it is receiving logs from each node.

Uninstall the full ELK stack

  • That everything included in the "Uninstall Kibana" and the "Uninstall Elasticsearch" sections were completed.

Install a custom Elasticsearch cluster

ansible-playbook -e @security.yml -e 'elasticsearch_nodes=4' addons/elasticsearch.yml

In this example, we are launching 4 Elasticsearch data nodes via the Mesos framework. You can verify everything in the "Install an Elasticsearch cluster" section. The only difference is that there should be 4 elasticsearch-executor-mantl tasks running in Mesos and visible in the Elasticsearch Mesos framework UI. View the Elasticsearch role documentation for all of the configuration variables. You can uninstall this cluster by running:

ansible-playbook -e @security.yml -e 'elasticsearch_uninstall=true elasticsearch_remove_data=true' addons/elasticsearch.yml

@langston-barrett
Copy link
Contributor

langston-barrett commented Jun 7, 2016

While almost everything worked, there is an itermittent problem with removing frameworks. The mantl-api logs look like this:

time="2016-06-07T14:09:40Z" level=debug msg="DELETE /1/install" 
time="2016-06-07T14:09:40Z" level=debug msg="GET https://marathon.service.consul:8080/v2/apps/" 
time="2016-06-07T14:09:40Z" level=debug msg="DELETE https://marathon.service.consul:8080/v2/apps/mantl/elasticsearch-client" 
time="2016-06-07T14:09:40Z" level=debug msg="DELETE /1/install" 
time="2016-06-07T14:09:40Z" level=debug msg="GET https://marathon.service.consul:8080/v2/apps/" 
time="2016-06-07T14:09:40Z" level=debug msg="DELETE https://marathon.service.consul:8080/v2/apps/mantl/elasticsearch" 
time="2016-06-07T14:09:40Z" level=debug msg="Looking for mantl/elasticsearch framework" 
time="2016-06-07T14:09:40Z" level=debug msg="GET http://lb0-control-02:15050/master/state.json" 
time="2016-06-07T14:09:40Z" level=debug msg="Framework mantl/elasticsearch not active" 

See CiscoCloud/mantl-api#46.

I'm not sure if we want to consider that a blocker for this PR, it seems like an issue in mantl-api that would affect current deployments just as often.

@langston-barrett langston-barrett merged commit dd6267d into master Jun 7, 2016
@langston-barrett langston-barrett deleted the feature/split-elk-role branch June 7, 2016 15:29
ryane added a commit that referenced this pull request Jun 9, 2016
@crumley
Copy link
Contributor

crumley commented Jun 14, 2016

fyi #1539

@tpolekhin
Copy link
Contributor

tpolekhin commented Jun 15, 2016

Kibana fails to launch on mantl master

+ exec java -Xms32m -Xmx128m -jar /tmp/mesosframework.jar --spring.application.name=kibana-mantl --mesos.framework.name=kibana-mantl --mesos.master=zk://sa20-control-01:2181,sa20-control-02:2181,sa20-control-03:2181,sa20-control-04:2181,sa20-control-05:2181/mesos --mesos.zookeeper.server=sa20-control-01:2181,sa20-control-02:2181,sa20-control-03:2181,sa20-control-04:2181,sa20-control-05:2181 --mesos.resources.cpus=0.50 --mesos.resources.mem=512 --mesos.resources.count=1 --mesos.resources.ports.UI_5601.host=ANY --mesos.resources.ports.UI_5601.container=5601 --mesos.docker.image=ciscocloud/mantl-kibana:4.3.2.1 --mesos.docker.network=BRIDGE '--mesos.command=export ELASTICSEARCH_SERVICE=elasticsearch-client-mantl; export KIBANA_SERVICE=kibana-mantl-task; export KIBANA_LOGSTASH_CONFIG=false; tini -s -- /launch.sh' --logging.level.com.containersolutions.mesos=WARN --elasticsearch.http=http://elasticsearch-executor.service.consul:4000 --server.port=31100

not sure why --elasticsearch.http=http://elasticsearch-executor.service.consul:4000
shouldn't it be like --elasticsearch.http=http://elasticsearch-client-mantl.service.consul:9200 ?

@ryane
Copy link
Contributor Author

ryane commented Jun 15, 2016

see #1550. what error are you getting?

@tpolekhin
Copy link
Contributor

@ryane no, i have a different one. It looked just like framework is dying to a timeout.
I tried to increase resources on both scheduler and executor and it helped!
So i suggest to review the default resources for Kibana framework.

Right now it works for me with 1cpu and 1024ram, but I'm sure we can crank it down a bit.

@tpolekhin
Copy link
Contributor

also, @ryane i just saw that you're using docker container mode for elasticsearch framework.
Im heavily suggesting switching to non-docker mode for elasticsearch framework.
I've been using ES framework for over half a year now, and its not stable at all in docker mode.
BTW, i noticed that just because 2 of my nodes died after only 10 hours uptime, with 72msg/sec load

Again, please consider switching to non-docker framework mode.
2 of my SA clusters run non-docker ES framework with more than 2 month uptime

@tpolekhin
Copy link
Contributor

FYI, another 2 instances of ES just failed.
So we have a total of 4 instances out of 5 failed in 13 hours :)
screen shot 2016-06-16 at 3 19 39 pm

@ryane
Copy link
Contributor Author

ryane commented Jun 17, 2016

what do you see in the logs when the nodes fail? I have seen the same thing but only due to resource issues - a node tries to use more memory than we allotted for it in the framework and mesos stops it. also, can you open a new issue for this so that we can track it better?

@ryane ryane mentioned this pull request Jun 17, 2016
3 tasks
@tpolekhin
Copy link
Contributor

@ryane i've seen this so many times on SA cluster - no matter how many resources you specify for the node, it will eventually die to OOM killer. Its just a matter of time. I had a 9-node ES cluster running with 4cpus and 16GB ram each, and i get a failure every day or two on average.

Non-docker mode, on the other hand, was stable for several month now.

@SergeyNosko
Copy link

@ryane What is an endpoint on kibana side? Registered one at marathon looks like https://mantl-worker-003:31100
1st - seems it's not httpS it's http
2nd - it gives me 404
$ curl http://mantl-worker-003:31100 {"timestamp":1466412718912,"status":404,"error":"Not Found","message":"No message available","path":"/"}

If I hit Kibana from Mantl GUI ( https://mantl-control-01/kibana) it works like a charm and proxies to kibana. I'm working on adding quicklinks to pangea and would like to understand where I can find an endpoint for Kibana GUI?

@langston-barrett
Copy link
Contributor

@tymofii-polekhin @SergeyNosko Please make separate issues, rather than posting on this (closed) PR.

ryane added a commit that referenced this pull request Jun 20, 2016
langston-barrett pushed a commit that referenced this pull request Jun 22, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants