-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There's something strange going on with hierarchies on Works. #2500
Comments
My initial thought is that the candidate causes for this kind of thing include:
|
in /pipeline/matcher/scripts Looking at
|
Also in
This property is set from the Sierra MARC field 001 on any documents that contain fields of type 774, which can be seen here (VPN required) |
Oddly, in works-merged ( There are no redirectSources in the version in works-merged, so nothing has been merged into it. |
Another possibility in cases like this is that there was a source data problem, and although it has since been resolved, the pipeline has hit a snag, and it's not been updated past a certain stage. (it doesn't look like this is the case here) The final (indexed) version has the following times: The merged version has the following times in its The identified version So at every stage, it has the same source modified time, which suggests that they are all built from the same version of the source data. Similarly, mergedTime matches between |
Just to be sure, I've checked the DLQs. the merger has some (7) messages on its DLQ, five of which are So this points to a problem somewhere in the relation embedder (including the path concatenator) |
The path concatenator is where that paucity of data mentioned in the first comment is resolved. When faced with a path It's not obvious how this could result in what we are seeing here, but it's worth a look. This has highlighted to me that the path concatenator hampers observability |
This is interesting. The James Gardiner Collection has suffered the same fate, and in the same position. If you visit the preceding or following sibling, and refresh the page (the hierarchy is not reloaded when you just click on the link), then the Drawings of Childbirth entry in the tree changes to James Gardiner. This makes it look like there's something special about the "parent" record that's making it grab hold of these two collections. |
Using the message injector, I pushed b62y2grn onto id_minter_output. If this does not fix the problem, then there is something awry in the processing of that document itself. If it does, then either this is something wrong with vsyg427x or it was some strange ephemeral thing I may not be able to reproduce. I will eventually do the same with ctu3s9j3 and vsyg427x to get everything back to normal, but I want to check the behaviour one-at-a-time. |
I wonder if something funny happened in the Batcher. See logs All those 332963i/3329263i.nn paths are part of the Rigshospitalet tree, In this log in particular, they are all in an input batch with SAFPA/CB/2/41/43. However, the selectors created look correct - it correctly identifies three distinct trees and provides the selectors for them. |
b62y2grn has gone all the way through now, and the problem is still there. |
So I tried vsyg427x, with no success. Though now the Rigshospital content is the one that appears in that slot. I assume that the last one to be processed gets presented there. James Gardiner is still stuck in the tree if you go straight to it. |
Got it! This is an interesting one, so I thought I'd just exclaim in here, then explain in the next comment. |
In the path concatenator, when a record is trying to find its parent or its children, it splits off the head or tail of its own path, then constructs a wildcard query for it. So, given All paths are constructed with this When we have a path with just one segment (i.e. it is the root of its own tree), it still does the same attempt at a query, using the split path. However, the head of the path, as used by the query, is not the whole of the path(because it's looking for the parent of this record), but an empty string. So the parent query looks for (admittedly this has always been inefficient, and it has now proven to be incorrect as well) The Enter So now, any document that is ingested as the root of its own tree is erroneously turned into a child of |
This will need some reindexing to fully fix, but if we're lucky, then it has only hit the ones mentioned above, so I can do a targeted reindex of just those. I'll do some analysis today to see how widespread it is. |
There are 48 records that have been affected by this, so it's simple enough to just push those through when we're ready.
|
How did you find those records? |
By running the query in that comment on works-merged-2023-11-09 in pipeline-2023-11-09 (using the developer console in ES). I found hzg8xkez earlier in the thread with a similar query containing Then I turned it into a list of ids for use in the message injector mentioned here
I think I will also have to reload hzg8xkez, to get it back into its rightful position, and purge whichever cuckoo is sitting in its place under vsyg427x at the time. (so it's actually 49) |
This is all fixed and working. However, you may still see the wrong tree due to caching. I have checked with a different browser, and both https://wellcomecollection.org/works/b62y2grn and https://wellcomecollection.org/works/ctu3s9j3 look OK. https://wellcomecollection.org/works/pss8y4yn shows that the James Gardiner Collection is back to normal and https://wellcomecollection.org/works/d6zfhy2b shows that the Rigshospitalet collection is back to normal, even if the cached version appears not to be. |
https://wellcome.slack.com/archives/C8X9YKM5X/p1702638185461429
https://wellcomecollection.org/works/b62y2grn is erroneously appearing as a child of https://wellcomecollection.org/works/vsyg427x within the FPA archive.
The text was updated successfully, but these errors were encountered: