Skip to content

Installation and configuration

Tadashi Maeno edited this page Feb 6, 2018 · 69 revisions

Installation

Harvester can be installed with or without root privilege.

With root privilege

Without root privilege

# # setup virtual environment (N.B. if your system supports conda instead of virtualenv see "How to use conda" in the Misc section) 
$ virtualenv harvester # or python -m venv harvester (for python 3)
$ cd harvester
$ . bin/activate

$ # install additional python packages if missing
$ pip install pip --upgrade

$ # install panda components
$ pip install git+git://github.com/PanDAWMS/panda-harvester.git

$ # copy sample setup and config files
$ mv etc/sysconfig/panda_harvester.rpmnew.template  etc/sysconfig/panda_harvester
$ mv etc/panda/panda_common.cfg.rpmnew.template etc/panda/panda_common.cfg
$ mv etc/panda/panda_harvester.cfg.rpmnew.template etc/panda/panda_harvester.cfg

Setup and system configuration files

Several parameters need to be adjusted in the setup file (etc/sysconfig/panda_harvester) and two config files (etc/panda/panda_common.cfg and etc/panda/panda_harvester.cfg). panda_harvester.cfg can be put remotely (see remote configuration files).

The following parameters need to be modified in etc/sysconfig/panda_harvester.

Name Description
PANDA_HOME Config files must be under $PANDA_HOME/etc
PYTHONPATH Must contain the pandacommon package and site-packages where the pandaharvester package is available
  • Example
export PANDA_HOME=$VIRTUAL_ENV
export PYTHONPATH=$VIRTUAL_ENV/lib/python2.7/site-packages/pandacommon:$VIRTUAL_ENV/lib/python2.7/site-packages

The logdir needs to be set in etc/panda/panda_common.cfg. It is recommended to use a non-NFS directory to avoid buffering. Here are additional explanations for logging parameters.

Name Description
logdir A directory for log files
  • Example
logdir = /var/log/panda

The following list shows parameters need to be adjusted in etc/panda/panda_harvester.cfg. You can use $XYZ or ${XYZ} if you want to set those parameters through environment variables.

Name Description
master.uname User name of the daemon process
master.gname Group name of the daemon process
db.database_filename Filename of the local database
db.verbose Set True to dump all SQL queries in the log file
pandacon.ca_cert CERN CA certificate file
pandacon.cert_file A grid proxy file to access the panda server
pandacon.key_file The same as pandacon.cert_file
qconf.configFile The queue configuration file. See the next section for details
qconf.queueList The list of PandaQueues for which the harvester instance works
credmanager.moduleName The module name of the credential manager
credmanager.className The class name of the credential manager
credmanager.certFile A grid proxy without VOMS extension. NoVomsCredManager generates VOMS proxy using the file

Concerning agent optimization, see the next section.

lockInterval, xyzInterval, and maxJobsXYZ

Most agents define lockInterval and xyzInterval (where 'xyz' is 'check', 'trigger', and so on, depending on agent actions) in panda_harvester.cfg. Each agent runs multiple threads in parallel and each thread processes job and/or worker objects independently. First each thread retrieves objects from the database, processes them, and finally releases them. lockInterval defines how long the objects are kept for a thread after they are retrieved. During the period other threads cannot touch the objects. Another thread can take those objects after lockInterval, which is useful when harvester is restarted after it was killed and the objects were not properly released. Note that lockInterval must be longer than the process time of each thread. Otherwise, multiple threads would try to process the same objects concurrently. On the other hand, xyzInterval defines how often the objects are processed by threads, i.e. once the objects are released by a thread, they are processed again after the interval of xyzInterval. maxJobsXYZ defines how many job objects are retrieved by a thread. Generally large maxJobsXYZ doesn't make sense since jobs are sequentially processed by the thread and the process time of the thread simply becomes longer. Also large maxJobsXYZ could be problematic in terms of memory usage since many job objects are loaded into RAM from the database before being processed.

Queue configuration file

Plug-ins for each PandaQueue is configured in the queue configuration file. The filename is defined in qconf.configFile. It has to be put in the $PANDA_HOME/etc/panda directory and/or at URL (see remote configuration files). This file might be integrated in the information system json in the future, but for now it has to be manually created. Here are examples of the queue configuration file for the grid and for HPC. The contents is a json dump of

{
"PandaQueueName1": {
		   "QueueAttributeName1": ValueQ_1,
		   "QueueAttributeName2": ValueQ_2,
		   ...
		   "QueueAttributeNameN": ValueQ_N,
		   "Agent1": {
		   	     "AgentAttribute1": ValueA_1,
			     "AgentAttribute2": ValueA_2,
			     ...
			     "AgentAttributeM": ValueA_M
			     },
		   "Agent2": {
		   	     ...
			     },
		   ...
		   "AgentX": {
		   	     ...
		   	     },
		   },
"PandaQueueName2": {
		   ...
		   },
...
"PandaQueueNameY": {
		   ...
		   },

}

Here is the list of queue attributes.

Name Description
prodSourceLabel Source label of the queue. managed for production
nQueueLimitJob The max number of jobs pre-fetched and queued, i.e. jobs in stating state
nQueueLimitWorker The max number of workers queued in the batch system, i.e. workers in submitted, pending, or idle state
maxWorkers The max number of workers. maxWorkers-nQueueLimitWorker is the number of running workers
maxNewWorkersPerCycle The max number of workers which can be submitted in a single submission cycle. 0 by default to be unlimited
truePilot To suppress heartbeats for jobs in running, transferring, finished, failed state
runMode self (by default) to submit workers based on nQueueLimit* and maxWorkers. slave to be centrally controlled by panda
allowJobMixture Jobs from different tasks can be given to a single worker if true
mapType Mapping between jobs and workers. NoJob = (workers themselves get jobs directly from Panda after they are submitted). OneToOne = (1 job x 1 worker). OneToMany = (1xN, aka the multiple consumer mode). ManyToOne = (Nx1, aka the multi-job pilot mode). Harvester prefetches jobs except NoJob.
useJobLateBinding true if the queue uses job-level late-binding. Note that for job-level late-binding harvester prefetches jobs to pass them to workers when those workers get CPUs, so mapType must not be NoJob. If this flag is false or omitted jobs are submitted together with workers.

Agent is preparator, submitter, workMaker, messenger, stager, monitor, and sweeper. Two agent parameters name and module are mandatory to define the class name module names of the agent. Roguly speaking,

from agentModle import agentName
agent = agentName()

is internally invoked. Other agent attributes are set to the agent instance as instance variables. Parameters for plugins are described in this page.

init.d script

An example of init.d script is available at etc/rc.d/init.d/panda_harvester.rpmnew.template. You need change VIRTUAL_ENV in the script and rename it to panda_harvester-apachectl. Change log and lock files if necessary. Then to start/stop harvester

$ etc/rc.d/init.d/panda_harvester start
$ etc/rc.d/init.d/panda_harvester stop

Misc

How to setup virtualenv if unavailable by default

  • For NERSC
$ module load python
$ module load virtualenv
  • For others
$ pip install virtualenv --user

or more details in https://virtualenv.pypa.io/en/stable/installation/

How to install python-daemon on Edison@NERSC

$ module load python
$ cd harvester
$ . bin/activate
$ pip install --index-url=http://pypi.python.org/simple/ --trusted-host pypi.python.org  python-daemon

How to install rucio-client on Edison@NERSC (Required only if RucioStager is used)

$ cd harvester
$ . bin/activate
$ pip install --index-url=http://pypi.python.org/simple/ --trusted-host pypi.python.org rucio-clients
$ cat etc/rucio.cfg.atlas.client.template | grep -v ca_cert > etc/rucio.cfg
$ echo "ca_cert = /etc/pki/tls/certs/CERN-bundle.pem" >> etc/rucio.cfg
$ echo "auth_type = x509_proxy" >> etc/rucio.cfg
$
$ # For tests
$ export X509_USER_PROXY=...
$ export RUCIO_ACCOUNT=...
$ rucio ping

How to install local panda-harvester package

$ cd panda-harvester
$ rm -rf dist; python setup.py sdist; pip install dist/pandaharvester-*.tar.gz --upgrade

How to use conda

If your system supports conda instead of virtualenv, setup conda before using pip.

  • For Cori@NERSC
$ module load python
$ conda create --name harvester python
$ mkdir harvester
$ cd harvester
$ source activate harvester

Auto restart

It is possible to automatically restart harvester when it died by using supervisord which can be installed via pip.

$ pip install supervisor

An example of supervisord configuration file is available at etc/panda/panda_supervisord.cfg. You need to rename it to panda_supervisord.cfg and change logfile, pidfile, and command parameters accordingly. The command parameter uses the init.d script. PROGNAME in the init.d script needs to be changed to

PROGNAME='python -u '${SITE_PACKAGES_PATH}'/pandaharvester/harvesterbody/master.py --foreground'

because applications to be run under supervisord must be executed in the foreground, i.e., not be daemonized. To start supervisord

$ supervisord -c etc/panda/panda_supervisord.cfg

then harvester is automatically started. To stop/start harvester,

$ supervisorctl stop panda-harvester
$ supervisorctl start panda-harvester

To stop supervisord

$ supervisorctl shutdown

Harvester is automatically stopped when supervisord is stopped.

High-performance configuration

It is possible to configure harvester instances with more powerful database backend (MariaDB) and multi-processing based on Apache+WSGI. Note that Apache is used to launch multiple harvester processes, so you don't have to use apache messengers for communication between harvester and workers unless that is needed.

MariaDB setup

First you need to make the HARVESTER database and the harvester account on MariaDB. E.g.

$ mysql -u root
MariaDB > create database HARVESTER;
MariaDB > CREATE USER 'harvester'@'localhost' IDENTIFIED BY 'password';
MariaDB > GRANT ALL PRIVILEGES ON HARVESTER.* TO 'harvester'@'localhost';

Note that harvester tables are automatically made when the harvester instance gets started, so you don't have make them by yourself. Then edit /etc/my.cnf if need to optimize the database by yourself, e.g.,

[mysqld]
max_allowed_packet=1024M

Harvester uses mysql-connector to access to MariaDB.

$ pip install https://dev.mysql.com/get/Downloads/Connector-Python/mysql-connector-python-2.1.6.tar.gz

Apache setup

First, make sure that httpd and mod_wsgi are installed on your node. An example of the httpd config file is available at etc/panda/panda_harvester-httpd.conf.rpmnew.template which needs to be renamed to panda_harvester-httpd.conf before being edited. User and Group need to be modified at least. In the httpd.conf there is a string like

   WSGIDaemonProcess pandahvst_daemon processes=2 threads=2 home=${VIRTUAL_ENV}

which defines the number of processes and the number of threads in each process. Those numbers may be increased if necessary.

Changes to panda_harvester.cfg

The following changes are required in panda_harvester.cfg

[db]
# engine sqlite or mariadb
engine = mariadb
# user name
user = harvester
# password
password = FIXME
# schema
schema = HARVESTER 

where engine should be set to mariadb and password should be changed accordingly.

[frontend]
# type
type = apache

where type should be set to apache. Note that the port number for apache is defined in panda_harvester-httpd.conf.

How to start/stop harvester

Use panda_harvester-apachectl to start or stop harvester. An example of apachectl is available at etc/rc.d/init.d/panda_harvester-apachectl.rpmnew.template. You need change VIRTUAL_ENV in the script and rename it to panda_harvester-apachectl. Then

$ etc/rc.d/init.d/panda_harvester-apachectl start
$ etc/rc.d/init.d/panda_harvester-apachectl stop

To test,

$ curl http://localhost:26080/entry -H "Content-Type: application/json" -d '{"methodName":"test", "workerID":123, "data":"none"}'

it will receive a message like 'workerID=123 not found in DB'.

Remote configuration files

It is possible to load system and/or queue configuration files via http/https. This is typically useful to have a centralized pool of configuration files, so that it is easy to see with which configuration each harvester instance is running. There are two environment variables HARVESTER_INSTANCE_CONFIG_URL and HARVESTER_QUEUE_CONFIG_URL to define URLs for system config and queue config files, respectively. If those variable are set, the harvester instance loads config files from those URLs and then overwrites parameters if they are specified in local config files. Sensitive information like database password should be stored only in local config files. System config files are read only when the harvester instance is launched, while queue config files are read every 10 min so that queue configuration can be dynamically changed during the instance is running. Note that remote queue config file is periodically cached in the database by Cacher which automatically gets started when the harvester instance is launched, so you don't have to do anything manually. However, when you edit remote queue config file and then want to run some unit tests which don't run Cacher, you have to manually cache it using cacherTest.py.

$ python lib/python*/site-packages/pandaharvester/harvestertest/cacherTest.py