Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KGL Pubfactory journals #3009

Merged
merged 14 commits into from
Jun 4, 2024

Conversation

brendan-oconnell
Copy link
Contributor

KGL PubFactory is a hosting platform for over 1300 academic journals. While individual PubFactory journal articles could previously be saved to Zotero using the Embedded Metadata translator, this new translator addresses a couple of issues:

@zoe-translates
Copy link
Collaborator

Hello Brendan :)

I added a few comments mostly about extracting stuff using selectors.

In addition, I think the regex in the translator metadata is a bit too complex to understand and debug (personally I had to rely on a visualizer), and it may miss some "multiple" links.

My observation is that with Pubfactory, the journal-article links all look like this:

domain; a path containing "view/journals/"; followed by intervening paths depending on the content type; followed by some path containing "xml"

This would translate to the RE (not valid JS syntax, but something to be put in Scaffold)

^https://[^/]+/view/journals/.+\.xml\b

which should capture both individual-article, issue-view, and the "overview" URLs.

But there's a catch with URLs like this:

https://journals.ametsoc.org/search?q1=climate+change

or https://journals.ametsoc.org/search?access=all&f_0=title&f_1=date&o_1=AND&q_0=climate+change&q_1=%257BFROM_DATE%253D%253D2022%257D&type_0=journalarticle (an advanced search).

For those I'm tempted to use the following (again for Scaffold)

^https://[^.]*journals\.[^/]+/search\b

This would catch domains like avmajournals.avma.org or journals.ametsoc.org from the test cases. I'm using a domain pattern that I'm not very sure of to restrict the matches, otherwise the only distinguishing factor would be the path "/search" which is too broad.

So the end result would be

^https://([^/]+/view/journals/.+\.xml|[^.]*journals\.[^/]+/search)\b

If my caution about the the "domain pattern" should be unnecessary, the regex could be further simplified to

^https://[^.]*journals\.[^/]+/(view/journals/.+\.xml|search)\b

How do you think?

@brendan-oconnell
Copy link
Contributor Author

brendan-oconnell commented Jul 6, 2023

Thanks for the extremely detailed and helpful comments! I've incorporated everything, including the new regex - I agree that my regex was overly complex and hard to read. Will submit a new commit soon, still working on another issue that turned up while I was doing testing.

@zoe-translates
Copy link
Collaborator

It appears that some (all?) Brill journals are using KGL Pubfactory too. Example: https://brill.com/view/journals/tpao/109/1-2/article-p1_1.xml Another candidate for migration.

@brendan-oconnell
Copy link
Contributor Author

Thanks for spotting this! Yes, at a glance all the Brill journals I looked at seem to use PubFactory as a platform. It doesn't appear that there's any migration necessary - I have the PubFactory translator already running in my browser extension for testing, and the PubFactory factory regex appears to be the top match for these Brill journal URLs, so it's defaulting to using the PubFactory translator anyway.

@adam3smith
Copy link
Collaborator

adam3smith commented Jul 17, 2023

Thanks for spotting this! Yes, at a glance all the Brill journals I looked at seem to use PubFactory as a platform. It doesn't appear that there's any migration necessary - I have the PubFactory translator already running in my browser extension for testing, and the PubFactory factory regex appears to be the top match for these Brill journal URL

  • I think this is fairly unstable, since the Brill translator also matches/still works and both have the same priority. I thought this worked alphabetical in that case, but could be by last modified data or purely random: in any case, we definitely don't want a case where two translators work on the same URL with the same priority
  • As a general rule, the priority of translators for CMS (like Atypon, KGL, Silverchair) that don't include a specific domain should be 200 to make it easier to override them if needed
  • In this case, if the performance of this translator is as good as the existing Brill one, we'd want to include 2-3 Brill tests (1-2 multiple & individual) and then delete the Brill translator.

@zoe-translates
Copy link
Collaborator

Oh, I should be more specific. I think the Brill journals use KGL Pubfactory, but I'm not sure about the books and reference works.

@brendan-oconnell
Copy link
Contributor Author

@zoe-translates yes, exactly - I was just looking at the journals. Brill publishes a ton of books, relatively few journals. We definitely still need the Brill translator for their books.

@adam3smith
Copy link
Collaborator

In that case, maybe just leaving the Brill translator handle all Brill content (which will happen once KGL is set to priority 200, I think) is the right move

@brendan-oconnell
Copy link
Contributor Author

OK, I've set KGL PubFactory translator as priority 200, so that the Brill translator handles all Brill content, as you suggested. Thanks!

@brendan-oconnell
Copy link
Contributor Author

There's another issue with the translator that I haven't been able to resolve: tests are running successfully in Scaffold and returning multiple, but when I test in the browser, the Connector is only saving basic metadata for multiple, i.e. Title and Abstract. This translator mostly relies on EM, so I looked into what's happening with the EM translator, and it seems that the Connector isn't able to read anything from the page's meta tags, and so is falling back on function addLowQualityMetadata.

I've found a couple possible explanations for this, but haven't figured out how to resolve the issue. One is simply extremely slow page load times: from the Network monitor in Firefox, I'm getting DOMContentLoaded: 3.10 s load: 3.90 s for an individual article page, e.g. https://journals.ametsoc.org/view/journals/amsm/59/1/amsmonographs-d-18-0011.1.xml. So it's possible that when running doWeb for multiple, the page hasn't fully loaded, so requestDocument(url) is only returning some parts of the page, causing EM to fall back on function addLowQualityMetadata. Solutions to this problem are discussed here (https://groups.google.com/g/zotero-dev/c/yIVWbu0xNlQ/m/THjnupckAwAJ) and here (https://groups.google.com/g/zotero-dev/c/EnC5teZkSP8/m/dmygGUMIBQAJ), with gist being "Use the dev tools to figure out how the site is generating the content dynamically and then do the same thing in your code."

I've tried doing this, but can't find an API request or JSON that the page is being generated from when I look at Network traffic.

Another possibility: I found this thread from you, @adam3smith, on a problem with the AMS translator, which no longer seems to exist: https://groups.google.com/g/zotero-dev/c/2gCD5-mM_Cg/m/tbJH6siaUWkJ. If we're indeed talking about the same AMS here, it appears that this could be the same issue with the content-type meta tag being improperly placed in the page. In that case, it would appear that there isn't a solution for this issue.

@zoe-translates
Copy link
Collaborator

zoe-translates commented Jul 18, 2023

I'm afraid that there's little you can do to give it a fix here, because the page you cited as an example wasn't valid HTML. Even if Firefox could "fix" it and display the page as if it were valid, its implementation of the JS API couldn't parse it in the intended way (although in this case there's no "correct" solution). And this is not just with Firefox -- neither Chromium nor Safari could do it.

In the EM translator, you have

	var metaTags = doc.head.getElementsByTagName("meta");
	Z.debug("Embedded Metadata: found " + metaTags.length + " meta tags.");
	if (forceLoadRDF /* check if this is called from doWeb */ && !metaTags.length) {
		if (doc.head) {
			Z.debug(doc.head.innerHTML
				.replace(/<style[^<]+(?:<\/style>|\/>)/ig, '')
				.replace(/<link[^>]+>/ig, '')
				.replace(/(?:\s*[\r\n]\s*)+/g, '\n')
			);
		}
		else {
			Z.debug("Embedded Metadata: No head tag");
		}
	}

Here the head property is on the HTMLDocument object obtained by parsing the fetched document into DOM. Because the input is malformed, the <meta> elements didn't go under head but body, so doc.head.getElementsByTagName("meta") doesn't get you anything (whereas doc.body.getElementsByTagName("meta") would have).

@brendan-oconnell
Copy link
Contributor Author

Thanks for this explanation! When I look at the Inspector on the fully loaded page in the browser (https://journals.ametsoc.org/view/journals/amsm/59/1/amsmonographs-d-18-0011.1.xml), it appears that the <meta> tags are actually (correctly) in the head, and if I run var metaTags = document.head.getElementsByTagName("meta"); in the Console, it does return the correct meta elements. And I should have mentioned earlier - the translator, and by extension EM, works fine for single items. So do you know why this is only an issue for multiple?

@zoe-translates
Copy link
Collaborator

The Inspector shows a live version of the DOM, which isn't just the result of parsing the document source. First the browser will try to parse the source, and then apply sanitization, to "fix" the errors it can fix. And on top of that, the document may be mutated by JS code executed by the browser.

But the parser features exposed by the JS API does just that, the parsing. There's no browser, and therefore, there's no extra processing (sanitization of anomalies in the source), and no execution of the document's JS code after that.

So when translation is performed on the live document, the translator code has access to the live DOM via the browser. But when it must fetch some additional source and parse it using JS API (via Zotero functions like requestDocument), it will be looking at something quite different -- without the processing that turns the "raw" DOM into the "browserable" one.

The latter is what happens when the code processes multiple items.

One workaround is to apply some "fixing" by ourselves in the translator. For example:

function scrape(doc, url) {
// move meta tags to head for EM
// this is a fix for some malformed HTML
for (let meta of doc.body.querySelectorAll('meta')) {
doc.head.appendChild(meta);
}

Here the code is dealing with exactly the same problem (malformed input -> <meta> that should have been in <head> ends up in <body>).

But this must be done carefully, if ever, lest it mutates the DOM of a live page.

(This hack is no longer necessary for the aboveshown translator because the sources have been fixed.)

Comment on lines +139 to +140
let em = await translator.getTranslatorObject();
em.itemType = 'journalArticle';
Copy link
Collaborator

@zoe-translates zoe-translates Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR, introducing a flag to Embedded Metadata.js that tells it to look for the <meta> tags in the body in addition to the head: PR #3083.

If that one gets merged, here you could add

if (doc.querySelector("body > meta")) {
	em.searchForMetaTagsInBody = true;
}

and that would be all you need to do in order to work around the bad HTML.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this extremely clear explanation! This all makes sense now. So I'll wait to make a new PR with searchForMetaTagsInBody, once your PR is merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're welcome <3 that PR is an "if" though. If it gets merged, you could do that. But if not, there's probably not going to be too much ill consequence with the "copy body meta tags to head" hack, if you restrict the hack to the scraping of multiples and leave the live-scraping of a doc unchanged (i.e. relying on the browser's sanitization).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Merged now!)

@brendan-oconnell
Copy link
Contributor Author

Hey, thanks for clarifying that - given that this translator doesn't address an urgent issue, I don't mind letting it sit for awhile and see if your PR gets merged. Otherwise, I'll try implementing the more hack-y solution.

PubFactory Journals.js Outdated Show resolved Hide resolved
PubFactory Journals.js Outdated Show resolved Hide resolved
@AbeJellinek AbeJellinek merged commit b947ce6 into zotero:master Jun 4, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants