-
Notifications
You must be signed in to change notification settings - Fork 753
Tuning for large code bases
If you landed on this page, chances are that you hit some problem w.r.t. scaling/latencies/performance etc. The initial step should be to setup monitoring to see related problems more clearly.
For both indexer and webapp, make sure to check how the operating system used deals with situations where allocated memory exceeds system resources. Specifically for Linux, make sure to avoid/tame the OOM killer to avoid nasty surprises like abrupt termination of the indexer process.
In general it is recommended to run both the indexer and web application with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/some/sensible/place/to/store/jvm/dumps
in order to capture the JVM dumps in case of out-of-memory exception so that is possible to analyze the dumps with tools like jhat
or http://www.eclipse.org/mat/
If you run the Indexer via the opengrok-indexer
script, keep in mind that by default it does not set Java heap size so it will use the default value.
This might not be enough, especially for large projects such as AOSP or when indexing lots of mid sized projects.
Make sure to have enough space in temporary directory. Universal ctags create temporary files there during indexing and the bigger parallelism the more files need to be created. These files are sometimes hundreds of megabytes each.
The usually recommended JVM heap size for the indexer is 8 GB. The heap usage can be monitored, see https://github.com/oracle/opengrok/wiki/Monitoring#indexer
The heap size usage depends on the level of parallelism used by the indexer. Generally, the higher parallelism, the higher heap usage will be.
Also, when history cache is created for a repository with large amount of changesets, this could consume the heap significantly. Because of this, the history for some repositories (Mercurial, Git) is handled in chunks to limit the memory requirements. The size of the chunks (i.e. the number of changesets to process in one go) can be tuned. The higher the chunks size, the higher memory usage will be however the indexer may finish quickly.
See https://github.com/oracle/opengrok/wiki/Indexer-configuration#indexer-tunables for details.
It is possible to disable handling of merge commits in Git via global/per-project configuration. If you have repository with rich history, this might help. Also see the above for history chunk size.
Lucene 4.x sets indexer defaults:
DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB = 1945;
DEFAULT_MAX_THREAD_STATES = 8;
DEFAULT_RAM_BUFFER_SIZE_MB = 16.0;
-
which might grow as big as 16GB (though
DEFAULT_RAM_BUFFER_SIZE_MB
shouldn't really allow it, but keep it around 1-2GB) -
the Lucene
RAM_BUFFER_SIZE_MB
can be tuned now using indexer parameter-m
, so running a 8GB 64 bit server JDK indexer with tuned docs flushing (assuming the indexer is being run from the Python wrapper. Otherwise pass the indexer options directly.):$ opengrok-indexer -J=-Xmx8g -J=-server --jar opengrok.jar -- \ -m 256 -s /source -d /data ...
For Solaris you might want to use also -J=-d64
The initial index creation process is resource intensive and often the error
java.io.IOException: error=24, Too many open files
appears in the logs. To
avoid this, increase the ulimit
value to a higher number.
With the Java modularization, the indexer process by itself (not performing any indexing) will have easily 250
or so open files due to all the Java jmod
files being open.
It is noted that the hard and soft limit for open files of 10240 works for mid-sized repositories and so the recommendation is to start with that value. Also, the higher parallelism of the indexer, the higher the limit has to be. See parallelism related tunables in https://github.com/oracle/opengrok/wiki/Indexer-configuration , in particular the indexingParallelism
tunable. The parallelism can be also set by using the indexer command line options.
The resource limits can be se also set when using the opengrok-sync
tool in the configuration for given command, using the:
limits: {RLIMIT_NOFILE: 2048}
directive in the configuration file. See the opengrok-sync documentation for more details and examples.
If you get a similar error to the open files limit, but for threads:
java.lang.OutOfMemoryError: unable to create new native thread
it might be due to strict security limits and you need to increase the limits.
The heap size limit for web application should be derived from the size of data generated by the indexer and also to reflect the size of WFST structures generated by the Suggester in the web application. The former will create memory pressure especially for multi-project searches. Thus, for precise tuning it might be prudent to estimate memory footprint of single all-project search (using memory profiler), determine how many requests the web application can serve simultaneously, multiply these 2 values and make sure the heap limit is bigger than that.
For Suggester data, it should be sufficient to compute the sum of lengths of all *.wfst
files under the data root and bump the heap limit by that value.
The web application utilizes several thread pools. These are usually sized based on the number of on-line CPUs (cores) in the system. By default Tomcat allows only 200 or so threads for the basic Connector
. The more CPUs (cores) you have in the system, the higher chance the limit will be reached. So, it might be necessary to bump the limit.
Also, when using the per project workflow, there is usually many indexer processes running in parallel. Each of these uses several RESTful API calls. These combined can lead to many threads created in the web application.
Configuration snippet example:
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443"
maxThreads="1024" />
There is also maxConnections
variable.
Tomcat by default supports only small deployments. For bigger ones you
might need to increase its heap (assuming 64-bit Java).
It will most probably be the same for other containers as well.
For Tomcat you can easily get this done by creating $CATALINA_BASE/bin/setenv.sh
:
# cat $CATALINA_BASE/bin/setenv.sh
JAVA_OPTS="$JAVA_OPTS -server"
# OpenGrok memory boost to cover all-project searches
# (7 MB * 247 projects + 300 MB for cache should be enough)
# 64-bit Java allows for more so let's use 8GB to be on the safe side.
# We might need to allow more for concurrent all-project searches.
JAVA_OPTS="$JAVA_OPTS -Xmx8g"
export JAVA_OPTS
For tomcat you might also hit a limit for HTTP header size (we use it to send the project list when requesting search results):
For Tomcat increase(add) in conf/server.xml
, for example:
<Connector port="8888" protocol="HTTP/1.1"
connectionTimeout="20000"
maxHttpHeaderSize="65536"
redirectPort="8443" />
Refer to docs of other containers for more info on how to achieve the same.
Failure to do so will result in HTTP 400 errors after first query - with the error "Error parsing HTTP request header".
The same tuning to Apache (handy in case you are running Apache in reverse proxy mode to Tomcat) can be done with the LimitRequestLine
directive:
LimitRequestLine 65536
LimitRequestFieldSize 65536
and also bump the default limit of responsefieldsize
:
LoadModule proxy_module libexec/mod_proxy.so
LoadModule proxy_http_module libexec/mod_proxy_http.so
<IfModule mod_proxy.c>
# The number of seconds Apache httpd waits for data sent by / to the backend.
# This should match the `interactiveCommandTimeout` setting in OpenGrok.
ProxyTimeout 600
ProxyPass /source/ http://localhost:8080/source/ responsefieldsize=16384
ProxyPass /source http://localhost:8080/source/ responsefieldsize=16384
ProxyPassReverse /source/ http://localhost:8080/source/
ProxyPassReverse /source http://localhost:8080/source/
</IfModule>
If multi-project search is performed frequently, it might be good to warm up file system cache after each reindex. This can be done e.g. with https://github.com/hoytech/vmtouch
It is recommended to store the Suggester data on SSD/Flash. This benefits both the suggester rebuild operation (that happens during reindex and also periodically) as well as web application startup (performs suggester initialization).