Skip to content

ESGFNode|ConfiguringMetricsService

Stephen Pascoe edited this page Apr 9, 2014 · 7 revisions
Wiki Reorganisation
This page has been classified for reorganisation. It has been given the category REVISE.
This page contains useful content but needs revision. It may contain out of date or inaccurate content.

Configuring the metrics service

The ESGF metrics service has the following components:

  • Access logging filter: adds logs to the metrics database.
  • Access log service: retrieves logs from the access log database.
  • Metrics database: implemented in the node database.

Configuring the TDS access logging filter

The TDS access logging filter intercepts file accesses to the TDS webapp, and logs the access to the metrics database. The filter is implemented in the jar files in $CATALINA_HOME/webapps/thredds/WEB-INF/lib:

esgf-node-manager-common-<version>.jar
esgf-node-manager-filters-<version>.jar
commons-dbutils-<version>.jar
commons-dbcp-<version>.jar
commons-pool-<version>.jar

The following configuration should be placed in $CATALINA_HOME/webapps/thredds /WEB-INF/web.xml, immediately after the authorization filter configuration (the ordering is important). Replace the values of DATABASE_USER and DATABASE_PASSWORD for your postgres database. Note: it's a good idea to use the same user / password that created the metrics database (see esgf_node_manager_initialize below):

  <!--  -->
  <!-- web.xml entry for the esg node filter    -->
  <!--  -->
  <filter>
    <filter-name>AccessLoggingFilter</filter-name>
    <filter-class>esg.node.filters.AccessLoggingFilter</filter-class>
    <init-param>
      <param-name>db.driver</param-name>
      <param-value>org.postgresql.Driver</param-value>
    </init-param>
    <init-param>
      <param-name>db.protocol</param-name>
      <param-value>jdbc:postgresql:</param-value>
    </init-param>
    <init-param>
      <param-name>db.host</param-name>
      <param-value>localhost</param-value>
    </init-param>
    <init-param>
      <param-name>db.port</param-name>
      <param-value>5432</param-value>
    </init-param>
    <init-param>
      <param-name>db.database</param-name>
      <param-value>esgcet</param-value>
    </init-param>
    <init-param>
      <param-name>db.user</param-name>
      <param-value>DATABASE_USER</param-value>
    </init-param>
    <init-param>
      <param-name>db.password</param-name>
      <param-value>DATABASE_PASSWORD</param-value>
    </init-param>
    <init-param>
      <param-name>extensions</param-name>
      <param-value>.nc</param-value>
    </init-param>
  </filter>
  <filter-mapping>
    <filter-name>AccessLoggingFilter</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>

Note: The parameters - as defined in elements above - may also be defined in the ESGF properties file $ESG_HOME/esgf.properties ($ESG_HOME defaults to /esg). Parameters defined in web.xml override values in esgf.properties.

Configuring the access log service

The access log service is part of the esgf-node-manager webapp. It is distributed as a WAR file: esgf-node-manager.war, and may be downloaded and installed with the esg-node installer script. The service should be configured to limit visibility to selected users (see below).

A Python client is provided. It is also installed with esg-node or, if the Python easy_install script is available (e.g., if the ESG publisher package is already installed):

easy_install -f http://www-pcmdi.llnl.gov/dist/externals 'esgf_node_manager>=0.2.0'

The client requires a proxy certificate as obtained from myproxy-logon. To run:

esgf_accesslog --service-url https://host.domain.gov/esgf-node-manager/accesslog starttime:endtime

Configuring the metrics database

The esgf_node_manager_initialize script creates the database schema for the metrics service. Note: use the same DATABASE_USER and DATABASE_PASSWORD as in the TDS filter configuration:

esgf_node_manager_initialize -c --dburl myname:[email protected]:5432/esgcet

Securing the access log service

The access log service must be secured to limit the visibility of the metrics logs to selected users. All files should be owned by the user that runs tomcat, usually 'tomcat'.

  • Add a new resource to conf/server.xml: in the GlobalNamingResources element add:

        <Resource name="MetricsUserDatabase" auth="Container"
              type="org.apache.catalina.UserDatabase"
              description="User database that can be updated and saved"
              factory="org.apache.catalina.users.MemoryUserDatabaseFactory"
              pathname="conf/metrics-users.xml" />
    
  • List the distinguished names (DNs) of users who can read the metrics accesslog service, in conf/metrics-users.xml. Note the ordering of the DN components is important. Tomcat must be restarted for any changes to take effect.

    ?xml version='1.0' encoding='utf-8'?>
    
  • Use the metrics-users.xml database for this webapp. Add the file conf/Catalina/localhost/esgf-node-manager.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    
  • When the service is called, use client authentication over SSL. At the end of webapps/esgf-node-manager/WEB-INF/web.xml:

      <security-constraint>
    <web-resource-collection>
      <web-resource-name>Access Logger</web-resource-name>
      <url-pattern>/accesslog/*</url-pattern>
    </web-resource-collection>
    <auth-constraint>
      <role-name>metrics</role-name>
    </auth-constraint>
    <user-data-constraint>
      <transport-guarantee>CONFIDENTIAL</transport-guarantee>
    </user-data-constraint>
    
    CLIENT-CERT Access Logger metrics

Getting high-level metrics from the database

The following query can be used to get user, file and volume download metrics:

SELECT 
  EXTRACT (YEAR FROM (TIMESTAMP WITH TIME ZONE 'epoch' + fixed_log.date_fetched * INTERVAL '1 second')) as year, 
  EXTRACT (MONTH FROM (TIMESTAMP WITH TIME ZONE 'epoch' + fixed_log.date_fetched * INTERVAL '1 second')) as month, 
  count(*) as downloads, 
  count(distinct url) as files, 
  count(distinct user_id_hash) as users, 
  to_char(sum(fixed_log.size)/1024/1024/1024, '9,999,999.99') as gb 
FROM (
  SELECT 
    file.url, 
    log.user_id_hash, 
    max(log.date_fetched) as date_fetched, 
    max(file.size) as size 
  FROM 
    esgf_node_manager.access_logging as log join 
    file_version as file on (log.url LIKE '%.nc' AND regexp_replace(log.url, E'^.*/(cmip5/.*\.nc)$', E'\\1') = file.url) 
  where log.success and log.duration > 1000 
  group by file.url,log.user_id_hash
) as fixed_log 
group by year,month order by year,month;
Clone this wiki locally