Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface ES startup errors to the Rally CLI console #1476

Merged
merged 5 commits into from
Jul 27, 2022

Conversation

DJRickyB
Copy link
Contributor

@DJRickyB DJRickyB commented Apr 13, 2022

It can be frustrating to troubleshoot why a Rally-managed Elasticsearch instance is not starting. Currently, we only log in the console the sub process's return code, with no further information. If you are intimately familiar with where installations go, and how to preserve them, you can troubleshoot this manually, but it would seem to be better to log the issue the first time rather than requiring a round of reproduction and a process of manually trying it yourself.

A straightforward reproduction of this issue is the following invocation, with a too small heap size:

% esrally race --track geonames --distribution-version 8.2.0 --runtime-jdk bundled --kill-running-processes --test-mode --car-params heap_size:1k     

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [8da62001-efd4-4456-a757-0bf89dffa51b]
[INFO] Preparing for race ...
[ERROR] Daemon startup failed with exit code [1]. STDERR:

	Exception in thread "main" java.lang.RuntimeException: starting java failed with [1]
	output:
	Error occurred during initialization of VM
	Too small maximum heap
	error:

		at org.elasticsearch.tools.launchers.JvmOption.flagsFinal(JvmOption.java:114)
		at org.elasticsearch.tools.launchers.JvmOption.findFinalOptions(JvmOption.java:79)
		at org.elasticsearch.tools.launchers.MachineDependentHeap.determineHeapSettings(MachineDependentHeap.java:61)
		at org.elasticsearch.tools.launchers.JvmOptionsParser.jvmOptions(JvmOptionsParser.java:135)
		at org.elasticsearch.tools.launchers.JvmOptionsParser.main(JvmOptionsParser.java:87)


[ERROR] Cannot race. Command '['./bin/elasticsearch', '-d', '-p', './pid']' returned non-zero exit status 1.
	Traceback (most recent call last):
  File "/Users/rick.boyd/dev/rally/esrally/mechanic/mechanic.py", line 590, in receiveMsg_StartNodes
    self.mechanic.start_engine()
  File "/Users/rick.boyd/dev/rally/esrally/mechanic/mechanic.py", line 707, in start_engine
    self.nodes = self.launcher.start(self.node_configs)
  File "/Users/rick.boyd/dev/rally/esrally/mechanic/launcher.py", line 134, in start
    return [self._start_node(node_configuration, node_count_on_host) for node_configuration in node_configurations]
  File "/Users/rick.boyd/dev/rally/esrally/mechanic/launcher.py", line 134, in <listcomp>
    return [self._start_node(node_configuration, node_count_on_host) for node_configuration in node_configurations]
  File "/Users/rick.boyd/dev/rally/esrally/mechanic/launcher.py", line 164, in _start_node
    node_pid = self._start_process(binary_path, env)
  File "/Users/rick.boyd/dev/rally/esrally/mechanic/launcher.py", line 221, in _start_process
    raise e
  File "/Users/rick.boyd/dev/rally/esrally/mechanic/launcher.py", line 216, in _start_process
    ProcessLauncher._run_subprocess(command_line=" ".join(cmd), env=env)
  File "/Users/rick.boyd/dev/rally/esrally/mechanic/launcher.py", line 203, in _run_subprocess
    subprocess.run(
  File "/Users/rick.boyd/.pyenv/versions/3.8.10/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['./bin/elasticsearch', '-d', '-p', './pid']' returned non-zero exit status 1.


Getting further help:
*********************
* Check the log files in /Users/rick.boyd/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/latest/.
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /Users/rick.boyd/.rally/logs.

--------------------------------
[INFO] FAILURE (took 12 seconds)
--------------------------------

This change uses the subprocess.run function introduced in Python 3.5 to check the subprocess and print its output to the console on failure. There is some background to why we do not do this currently in #879, but I believe the output is worthwhile, and at the time I think 3.5 was very recently our minimum Python version so this option may have been overlooked... or there may be a more technical reason I am missing.

@DJRickyB DJRickyB added the enhancement Improves the status quo label Apr 13, 2022
@DJRickyB DJRickyB added this to the 2.5.0 milestone Apr 13, 2022
@DJRickyB DJRickyB self-assigned this Apr 13, 2022
@DJRickyB
Copy link
Contributor Author

@pquentin
Copy link
Member

I believe the build was interrupted after 80 minutes, which explains the failure. @dliappis Do you know how we can bump the timeout?

@DJRickyB
Copy link
Contributor Author

I believe the build was interrupted after 80 minutes, which explains the failure. @dliappis Do you know how we can bump the timeout?

I've been looking at this the last couple hours and very consistently getting the timeout on this branch, I think the regression is legit

@dliappis
Copy link
Contributor

I believe the build was interrupted after 80 minutes, which explains the failure. @dliappis Do you know how we can bump the timeout?

As we don't have .ci definitions stored within this repo (see #1395) unfortunately this is controlled in a different private repo.

@pquentin
Copy link
Member

I've been looking at this the last couple hours and very consistently getting the timeout on this branch, I think the regression is legit

Would you still be interested in a review? I've been waiting for the investigation of that regression.

@pquentin pquentin removed their request for review April 20, 2022 07:29
@pquentin pquentin modified the milestones: 2.5.0, 2.5.1 Jun 22, 2022
@pquentin pquentin modified the milestones: 2.5.1, 2.x Jul 4, 2022
@DJRickyB DJRickyB changed the title Surface ES startup errors to the Rally log Surface ES startup errors to the Rally CLI console Jul 13, 2022
@DJRickyB DJRickyB requested review from b-deam and inqueue and removed request for danielmitterdorfer July 13, 2022 18:24
@DJRickyB DJRickyB modified the milestones: 2.x, 2.6.1 Jul 13, 2022
Copy link
Member

@inqueue inqueue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this is nice!

The first time I checked out the PR branch and executed a test, the command seem to hang just after displaying STDERR.

(.venv) grape:rally (pr/1476) $ esrally race --track geonames --distribution-version 8.3.0 --runtime-jdk bundled --kill-running-processes --test-mode --car-params heap_size:1b

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [1b31ab29-cff1-431a-992e-d52eea732ee0]
[INFO] Preparing for race ...
[ERROR] Daemon startup failed with exit code [1]. STDERR:

	Exception in thread "main" java.lang.RuntimeException: starting java failed with [1]
	output:

	error:
	Invalid initial heap size: -Xms1b
	Error: Could not create the Java Virtual Machine.
	Error: A fatal exception has occurred. Program will exit.
		at org.elasticsearch.server.cli.JvmOption.flagsFinal(JvmOption.java:113)
		at org.elasticsearch.server.cli.JvmOption.findFinalOptions(JvmOption.java:80)
		at org.elasticsearch.server.cli.MachineDependentHeap.determineHeapSettings(MachineDependentHeap.java:59)
		at org.elasticsearch.server.cli.JvmOptionsParser.jvmOptions(JvmOptionsParser.java:132)
		at org.elasticsearch.server.cli.JvmOptionsParser.determineJvmOptions(JvmOptionsParser.java:90)
		at org.elasticsearch.server.cli.ServerProcess.createProcess(ServerProcess.java:211)
		at org.elasticsearch.server.cli.ServerProcess.start(ServerProcess.java:106)
		at org.elasticsearch.server.cli.ServerProcess.start(ServerProcess.java:89)
		at org.elasticsearch.server.cli.ServerCli.startServer(ServerCli.java:213)
		at org.elasticsearch.server.cli.ServerCli.execute(ServerCli.java:90)
		at org.elasticsearch.common.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:54)
		at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
		at org.elasticsearch.cli.Command.main(Command.java:50)
		at org.elasticsearch.launcher.CliToolLauncher.main(CliToolLauncher.java:64)

-First execution hung here for about a minute before exiting manually

[ERROR] Cannot race. Command '['./bin/elasticsearch', '-d', '-p', './pid']' returned non-zero exit status 1.
	Traceback (most recent call last):
  File "/Users/jbryan/dev/src/rally/esrally/mechanic/mechanic.py", line 590, in receiveMsg_StartNodes
    self.mechanic.start_engine()
  File "/Users/jbryan/dev/src/rally/esrally/mechanic/mechanic.py", line 707, in start_engine
    self.nodes = self.launcher.start(self.node_configs)
  File "/Users/jbryan/dev/src/rally/esrally/mechanic/launcher.py", line 134, in start
    return [self._start_node(node_configuration, node_count_on_host) for node_configuration in node_configurations]
  File "/Users/jbryan/dev/src/rally/esrally/mechanic/launcher.py", line 134, in <listcomp>
    return [self._start_node(node_configuration, node_count_on_host) for node_configuration in node_configurations]
  File "/Users/jbryan/dev/src/rally/esrally/mechanic/launcher.py", line 164, in _start_node
    node_pid = self._start_process(binary_path, env)
  File "/Users/jbryan/dev/src/rally/esrally/mechanic/launcher.py", line 219, in _start_process
    raise e
  File "/Users/jbryan/dev/src/rally/esrally/mechanic/launcher.py", line 216, in _start_process
    ProcessLauncher._run_subprocess(command_line=" ".join(cmd), env=env)
  File "/Users/jbryan/dev/src/rally/esrally/mechanic/launcher.py", line 203, in _run_subprocess
    subprocess.run(
  File "/Users/jbryan/.pyenv/versions/3.10.4/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['./bin/elasticsearch', '-d', '-p', './pid']' returned non-zero exit status 1.


Getting further help:
*********************
* Check the log files in /Users/jbryan/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/latest/.
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /Users/jbryan/.rally/logs.

--------------------------------
[INFO] FAILURE (took 12 seconds)
--------------------------------

Subsequent executions exited as expected.

Copy link
Member

@b-deam b-deam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement, LGTM!

@DJRickyB DJRickyB merged commit d4c49f2 into elastic:master Jul 27, 2022
@DJRickyB DJRickyB deleted the log_es_err branch July 27, 2022 15:10
b-deam added a commit to b-deam/rally that referenced this pull request Jul 28, 2022
DJRickyB pushed a commit that referenced this pull request Jul 28, 2022
…1550)

it definitely appears that any capturing of stdout or stderr to a pipe is causing the deadlock.
@pquentin pquentin removed this from the 2.7.0 milestone Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants