Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A fix for the harvesting regression/tests introduced in 10836 #10990

Merged
merged 1 commit into from
Nov 1, 2024

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Oct 30, 2024

What this PR does / why we need it:

In one of the last commits in PR #10836, while addressing feedback from review, I rearranged/tried to clean up some validation and sanitizing code. Unfortunately, that introduced an error when importing harvested datasets (specifically, metadata-poor datasets created from oai_dc, where sanitizing invalid, or filling in missing required values is usually required). This PR fixes the regression.

The tests that are failing in develop have passed in https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/view/change-requests/job/PR-10990/1/.

Which issue(s) this PR closes:

Special notes for your reviewer:

Suggestions on how to test this:

Very straightforward; trying to harvest anything from demo.dataverse.org using oai_dc is going to fail in develop as of now.
This configuration, for example:

server url: https://demo.dataverse.org/oai
set: controlTestSet
metadata format: oai_dc 

There are 7 datasets in the set; all 7 will fail in a develop build; all 7 should succeed with this PR.

The 2 api tests that are failing in the dev. branch:

testHarvestingClientRun_AllowHarvestingMissingCVV_False
testHarvestingClientRun_AllowHarvestingMissingCVV_True

should now be passing.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@landreev landreev marked this pull request as ready for review October 31, 2024 00:14
@landreev landreev added Feature: Harvesting Size: 3 A percentage of a sprint. 2.1 hours. labels Oct 31, 2024
@cmbz cmbz added the FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) label Oct 31, 2024
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I see it addresses bad values which were the source of the test fails.

@ofahimIQSS
Copy link
Contributor

ofahimIQSS commented Oct 31, 2024

PR looks good but I'm having trouble passing the 2 API tests.

Here's what was done:

  1. Removed ./docker-dev-volumes directory
  2. Build PR on local
  3. Create a collection > Add Harvestable Client>
    server url: https://demo.dataverse.org/oai
    set: controlTestSet
    metadata format: oai_dc
  4. Run the harvest, verify that harvest ran successfully
  5. Run the API Tests: mvn test -Dtest=HarvestingClientsIT#testHarvestingClientRun_AllowHarvestingMissingCVV_False
    mvn test -Dtest=HarvestingClientsIT#testHarvestingClientRun_AllowHarvestingMissingCVV_True

Issue: API tests are both failing throwing similar error:
Note: I also tried running the scripts after a fresh build with no data (Before creating and getting harvested data) and it still failed.

image
image

@landreev
Copy link
Contributor Author

Note that you do not want to perform steps 3. and 4. before step 5. That will result in the API tests failing - because the tests will be trying again to harvest the datasets that are already harvested and exist locally on your system. (We should probably modify and improve the tests so that they work without the expectation that the datasets do not exist in the db; but that would be outside of this PR. They were written to run under Jenkins, and the database is always blank there. For now, make sure to remove the harvesting client and the associated datasets before running the API tests.)

But, since you reported that

I also tried running the scripts after a fresh build with no data (Before creating and getting harvested data) and it still failed.

there must be something else going on. Please copy and paste all the console output from these tests, plus any error messages from server.log around the time of failures, plus (probably most importantly) the dedicated log files left from the failed harvesting runs. These will be in the same directory as server.log, and named like harvest_<8 CHARACTER STRING>_<DATE STAMP>.log

3. Create a collection > Add Harvestable Client>
   server url: https://demo.dataverse.org/oai
   set: controlTestSet
   metadata format: oai_dc

4. Run the harvest, verify that harvest ran successfully

5. Run the API Tests: mvn test -Dtest=HarvestingClientsIT#testHarvestingClientRun_AllowHarvestingMissingCVV_False
   mvn test -Dtest=HarvestingClientsIT#testHarvestingClientRun_AllowHarvestingMissingCVV_True

@landreev
Copy link
Contributor Author

@ofahimIQSS
Copy link
Contributor

Just adding console log outputs for the API tests.

FYI @sekmiller
Console Log Outputs.docx

@sekmiller
Copy link
Contributor

@landreev @ofahimIQSS it's not the harvest that's failing - at least not when I test it locally. What's failing is there's a search run on "q": "metadataSource:h45c7704" - and it's not getting any results, when it's expected to return all of the harvested datasets.

@landreev
Copy link
Contributor Author

landreev commented Nov 1, 2024

@sekmiller @ofahimIQSS - that's exactly what I suspected was taking place. I got it to fail on my local build once too; and it was just that - harvests successful, but no hits from the search engine. Due, I'm assuming, to the (relatively) recent autocommit/softcommit changes in the solr config, that in practice added delays between indexing something and seeing it in search results.

The tests in question have passed on Jenkins 3 time in a row in the meantime.

@ofahimIQSS Could you post the actual harvesting logs, to confirm for sure that that's what's happening on your local build as well?

It would be great to resolve and merge this asap, seeing how the Jenkins tests are failing on all PRs on these tests.

@landreev
Copy link
Contributor Author

landreev commented Nov 1, 2024

@ofahimIQSS @sekmiller (I just had another guess though, as to why it's failing on our local builds and not on Jenkins, that has nothing to do with soft commit or indexing delays; let me verify)

@ofahimIQSS
Copy link
Contributor

@landreev I looked in /usr/local/payara6/glassfish/domains/domain1/logs/ and couldn't find the harvesting logs since the script didn't get up to the point of adding a client. I'm assuming thats the reason it wasn't generated.

@landreev
Copy link
Contributor Author

landreev commented Nov 1, 2024

@ofahimIQSS

I looked in /usr/local/payara6/glassfish/domains/domain1/logs/ and couldn't find the harvesting logs since the script didn't get up to the point of adding a client. I'm assuming thats the reason it wasn't generated.

No, you definitely went way past creating the clients. Your tests ran the actual harvests according to the error messages. So the harvest logs must be there.
Are you looking in the right place? - was that test run in a Docker instance?

Strictly speaking, we don't need to see your logs to know that the actual harvests succeeded during your run, similarly to what @sekmiller saw on his instance. Because there are things like this in the console output that you posted:

{
    "status": "OK",
    "data": {
        "nickName": "h45c7704",
        "dataverseAlias": "dvac1a4953",
        "type": "oai",
        "style": "default",
        "harvestUrl": "https://demo.dataverse.org/oai",
        "archiveUrl": "https://demo.dataverse.org",
        "archiveDescription": "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.",
        "metadataFormat": "oai_dc",
        "set": "controlTestSet2",
        "schedule": "none",
        "status": "inActive",
        "allowHarvestingMissingCVV": true,
        "lastHarvest": "Fri Nov 01 13:50:43 UTC 2024",
        "lastResult": "SUCCESS",
        "lastSuccessful": "Fri Nov 01 13:50:43 UTC 2024",
        "lastNonEmpty": "Fri Nov 01 13:50:43 UTC 2024",
        "lastDatasetsHarvested": 8,
        "lastDatasetsDeleted": 0,
        "lastDatasetsFailed": 0
    }
}

(the above is the output of /api/harvest/clients/...)
But please do find these logs - it is super important to be able to look at the right server logs when testing virtually anything.

@landreev
Copy link
Contributor Author

landreev commented Nov 1, 2024

@sekmiller @ofahimIQSS
The actual answer is very simple: these tests require an extra feature flag, and are failing without it.
If you look at https://github.com/gdcc/dataverse-ansible/blob/develop/tasks/dataverse-optional-settings.yml - the configuration Jenkins uses:

- name: set dataverse.feature.index-harvested-metadata-source for harvesting tes
ts
  become: yes
  become_user: "{{ dataverse.payara.user }}"
  shell: '{{ payara_dir }}/bin/asadmin create-jvm-options "-Ddataverse.feature.index-harvested-metadata-source=true"'
  when: dataverse.api.test_suite == true

If you look at the line that Stephen posted earlier:

What's failing is there's a search run on "q": "metadataSource:h45c7704" ...

i.e., the test expects to find these datasets in the search engine under metadataSource = the (randomly-generated) name of the harvesting client. But, without the feature flag/jvm option above, all harvested content is indexed under metadataSource = Harvested.

@ofahimIQSS
Copy link
Contributor

ofahimIQSS commented Nov 1, 2024

Might be a little late to the party but here are the files requested. I will create a PR to document how to access the Harvesting logs for future reference.

harvest_cleanup_ha6ead1a_2024-11-01T16-37-53.txt
harvest_cleanup_hdbb23ed_2024-11-01T16-38-34.txt
harvest_ha6ead1a_2024-11-01T16-37-53.log
harvest_hdbb23ed_2024-11-01T16-38-34.log
server.log

Edit: #10996 --- Doc on accessing harvesting logs PR

@landreev
Copy link
Contributor Author

landreev commented Nov 1, 2024

@ofahimIQSS

I will create a PR to document how to access the Harvesting logs for future reference.

That's a great idea!

... and yeah, the log files do confirm what we assumed was taking place. like this line at the end:

<message>Datasets created/updated: 8, datasets deleted: 0, datasets failed: 0</message>

i.e., harvests were in fact successful, but the results not indexed as the tests expected...

@sekmiller
Copy link
Contributor

sekmiller commented Nov 1, 2024

@landreev do you have another theory beyond the soft commit? Would it make sense to put a pause in before the search?

Sorry - missed the feature flag comment above.

@landreev
Copy link
Contributor Author

landreev commented Nov 1, 2024

Any chance this could be merged today? - just to have all the Jenkins tests passing again by next week.

@landreev
Copy link
Contributor Author

landreev commented Nov 1, 2024

(I just realized it was already pretty late :) - can definitely wait if need be)

@ofahimIQSS
Copy link
Contributor

I'm happy to report the 2 tests passed after setting the feature flag as mentioned above. Testing Complete, merging PR.
Testing of 10990.docx

@ofahimIQSS ofahimIQSS merged commit c685fcf into develop Nov 1, 2024
15 of 16 checks passed
@ofahimIQSS ofahimIQSS deleted the 10989-harvesting-regression branch November 1, 2024 22:20
@ofahimIQSS ofahimIQSS removed their assignment Nov 1, 2024
@pdurbin pdurbin added this to the 6.5 milestone Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) Size: 3 A percentage of a sprint. 2.1 hours.
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

PR 10836 broke harvesting tests, needs an urgent fix
6 participants