Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvesters not terminating #227

Closed
sjbruce opened this issue Jul 11, 2024 · 5 comments
Closed

Harvesters not terminating #227

sjbruce opened this issue Jul 11, 2024 · 5 comments

Comments

@sjbruce
Copy link

sjbruce commented Jul 11, 2024

CKAN version 1.6.0

Describe the bug
Harvest jobs of fresh installs of CKAN 1.6.0 do not appear to be able to terminate by themselves as previous versions do.

Current job has been running for well over an hour, but it has inserted all datasets correctly.

However, the process appears to fail before the indexes are updated as the home page shows a dataset count of zero and no E*Vs are listed as having any datasets attached to them.

The datasets page does show the datasets, E*Vs, responsible organizations, tags, resources types, licenses, formats.

Map will show dataset extents and filters appear to be working properly.

Log outputs for the ckan and harvester containers are attached.

Steps to reproduce
Steps to reproduce the behavior:

  • Created a WAF Harvester (harvester config below)
  • Ran Harvester

Expected behavior
The harvester should have run and produced a set of results detailing how many datasets added, updated, deleted, etc.

Additional details
image

Configuration:

{
  "default_tags": [],
  "default_extras": {
    "encoding": "utf8",
    "h_source_id": "{harvest_source_id}",
    "h_source_url": "{harvest_source_url}",
    "h_source_title": "{harvest_source_title}",
    "h_job_id": "{harvest_job_id}",
    "h_object_id": "{harvest_object_id}"
  },
  "override_extras": false,
  "clean_tags": true,
  "validator_profiles": ["iso19115"],
  "harvest_iso_categories": false
}

CKAN Container & Harvester Logs:

ckan.log
ckan_harvesters.log

@sjbruce
Copy link
Author

sjbruce commented Jul 11, 2024

I should note that the harvester configuration above is a direct lift from the harvester configuration from a 1.5.0 deployment of CKAN

@fostermh
Copy link
Member

fostermh commented Jul 11, 2024

is the ckan_run_harvester container running? Are the cron jobs in this container executing?

you can run the harvester cleanup manually by executing ckan --config=/srv/app/ckan.ini harvester run or by clocking 'stop' in the gui.

see /contrib/docker/crontab for a list of cron jobs that are run in the ckan_run_harvester container

It could be related to container permissions. the ckan_run_harvester must be run as root.

@sjbruce
Copy link
Author

sjbruce commented Jul 11, 2024

ckan_run_harvester is running but there don't appear to be any cron jobs running or indeed scheduled.

The docker file does have a line to copy the crontab file to the container and it is in /srv/app/src/ckan/contrib/docker but if I look at /etc/crontabs/root it simply lists the instructions to run cron jobs in /etc/periodic/ sub-directories, all of which are empty.

It doesn't look like the cron jobs are installed.

Running the command above it complains about "SECRET_KEY" which likely makes part or all of this down to not running the ckan generate config command and grabbing the appropriate key values or executing the commented out commands at the top of the .env file.

I note that those commands will fail on Windows/WSL due to some low-level nonsense on that part. I'll work around it and rebuild the containers to see if that makes a difference.

I imagine it'll let the command above run, I don't suspect it'll change anything with the cron jobs themselves.

@fostermh
Copy link
Member

There is a couple of issues here.

line 20 in ckan-run-harvester-entrypoint.sh should be cat /srv/app/src/ckan/contrib/docker/crontab | crontab -

while ckan can read it's config from environment variables the command line tools do not. so in order for all the cronjob tasks to work we need to update the ckan.ini.

uncomment the following lines in your ckan.ini in the container

ckan.plugins = envvars
              stats
              text_view
              image_view
              recline_view
              datastore
              datapusher
              scheming_datasets
              scheming_organizations
              scheming_groups
              scheming_nerf_index
              fluent
              harvest
              ckan_harvester
              csw_harvester
              waf_harvester
              doc_harvester
              ckan_schema_harvester
              spatial_metadata
              spatial_query
              spatial_harvest_metadata_api
              cioos_harvest
              cioos_theme
              ckan_cioos_harvester
              dcat
              structured_data
              resource_proxy
              geo_view
              geojson_view
              wmts_view
              ckan_spatial_harvester
              datastream_harvester
              #geonetwork_harvester

#   module-path:file to schemas being used
scheming.dataset_schemas = ckanext.scheming:cioos_siooc_schema.json
scheming.presets = ckanext.scheming:presets.json
                   ckanext.fluent:presets.json
scheming.dataset_fallback = true
scheming.organization_schemas = ckanext.scheming:organization.json
scheming.group_schemas = ckanext.scheming:group.json

It is odd that the fetch and gather containers work while the run container does not...
This config settings issue would also account for odd indexing problems.

@fostermh
Copy link
Member

Note that there appears to be some odd behaviour when updating the frequency of a harvest job. While the change will show up in the GUI after hitting save. the time of the next harvest job run is not adjusted in the database until the next time it runs. This means that when going from weekly to always frequency, for example, the job will not be updated until the next time it runs, potentially in a week. To update sooner you will need to manually run the harvest to insure the database is updated to the new settings.

@sjbruce sjbruce closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants