Add KGL Pubfactory journals #3009

brendan-oconnell · 2023-03-29T14:10:05Z

KGL PubFactory is a hosting platform for over 1300 academic journals. While individual PubFactory journal articles could previously be saved to Zotero using the Embedded Metadata translator, this new translator addresses a couple of issues:

supports multiple for journal issue landing pages, e.g. https://journals.ashs.org/hortsci/view/journals/hortsci/58/5/hortsci.58.issue-5.xml
removes word "Abstract" from abstractNote for journals that include the word Abstract in the Abstract field, e.g. https://journals.ametsoc.org/view/journals/atsc/79/12/JAS-D-22-0008.1.xml. This addresses American Meteorological Society: No longer uses Atypon #2997

PubFactory Journals.js

zoe-translates · 2023-07-04T05:47:16Z

Hello Brendan :)

I added a few comments mostly about extracting stuff using selectors.

In addition, I think the regex in the translator metadata is a bit too complex to understand and debug (personally I had to rely on a visualizer), and it may miss some "multiple" links.

My observation is that with Pubfactory, the journal-article links all look like this:

domain; a path containing "view/journals/"; followed by intervening paths depending on the content type; followed by some path containing "xml"

This would translate to the RE (not valid JS syntax, but something to be put in Scaffold)

^https://[^/]+/view/journals/.+\.xml\b

which should capture both individual-article, issue-view, and the "overview" URLs.

But there's a catch with URLs like this:

https://journals.ametsoc.org/search?q1=climate+change

or https://journals.ametsoc.org/search?access=all&f_0=title&f_1=date&o_1=AND&q_0=climate+change&q_1=%257BFROM_DATE%253D%253D2022%257D&type_0=journalarticle (an advanced search).

For those I'm tempted to use the following (again for Scaffold)

^https://[^.]*journals\.[^/]+/search\b

This would catch domains like avmajournals.avma.org or journals.ametsoc.org from the test cases. I'm using a domain pattern that I'm not very sure of to restrict the matches, otherwise the only distinguishing factor would be the path "/search" which is too broad.

So the end result would be

^https://([^/]+/view/journals/.+\.xml|[^.]*journals\.[^/]+/search)\b

If my caution about the the "domain pattern" should be unnecessary, the regex could be further simplified to

^https://[^.]*journals\.[^/]+/(view/journals/.+\.xml|search)\b

How do you think?

PubFactory Journals.js

brendan-oconnell · 2023-07-06T14:04:31Z

Thanks for the extremely detailed and helpful comments! I've incorporated everything, including the new regex - I agree that my regex was overly complex and hard to read. Will submit a new commit soon, still working on another issue that turned up while I was doing testing.

…leaner code

zoe-translates · 2023-07-12T14:04:09Z

It appears that some (all?) Brill journals are using KGL Pubfactory too. Example: https://brill.com/view/journals/tpao/109/1-2/article-p1_1.xml Another candidate for migration.

brendan-oconnell · 2023-07-17T12:28:17Z

Thanks for spotting this! Yes, at a glance all the Brill journals I looked at seem to use PubFactory as a platform. It doesn't appear that there's any migration necessary - I have the PubFactory translator already running in my browser extension for testing, and the PubFactory factory regex appears to be the top match for these Brill journal URLs, so it's defaulting to using the PubFactory translator anyway.

adam3smith · 2023-07-17T12:52:30Z

Thanks for spotting this! Yes, at a glance all the Brill journals I looked at seem to use PubFactory as a platform. It doesn't appear that there's any migration necessary - I have the PubFactory translator already running in my browser extension for testing, and the PubFactory factory regex appears to be the top match for these Brill journal URL

I think this is fairly unstable, since the Brill translator also matches/still works and both have the same priority. I thought this worked alphabetical in that case, but could be by last modified data or purely random: in any case, we definitely don't want a case where two translators work on the same URL with the same priority
As a general rule, the priority of translators for CMS (like Atypon, KGL, Silverchair) that don't include a specific domain should be 200 to make it easier to override them if needed
In this case, if the performance of this translator is as good as the existing Brill one, we'd want to include 2-3 Brill tests (1-2 multiple & individual) and then delete the Brill translator.

zoe-translates · 2023-07-17T13:36:58Z

Oh, I should be more specific. I think the Brill journals use KGL Pubfactory, but I'm not sure about the books and reference works.

brendan-oconnell · 2023-07-17T13:48:18Z

@zoe-translates yes, exactly - I was just looking at the journals. Brill publishes a ton of books, relatively few journals. We definitely still need the Brill translator for their books.

adam3smith · 2023-07-17T13:51:40Z

In that case, maybe just leaving the Brill translator handle all Brill content (which will happen once KGL is set to priority 200, I think) is the right move

brendan-oconnell · 2023-07-18T08:27:42Z

OK, I've set KGL PubFactory translator as priority 200, so that the Brill translator handles all Brill content, as you suggested. Thanks!

brendan-oconnell · 2023-07-18T08:41:37Z

There's another issue with the translator that I haven't been able to resolve: tests are running successfully in Scaffold and returning multiple, but when I test in the browser, the Connector is only saving basic metadata for multiple, i.e. Title and Abstract. This translator mostly relies on EM, so I looked into what's happening with the EM translator, and it seems that the Connector isn't able to read anything from the page's meta tags, and so is falling back on function addLowQualityMetadata.

I've found a couple possible explanations for this, but haven't figured out how to resolve the issue. One is simply extremely slow page load times: from the Network monitor in Firefox, I'm getting DOMContentLoaded: 3.10 s load: 3.90 s for an individual article page, e.g. https://journals.ametsoc.org/view/journals/amsm/59/1/amsmonographs-d-18-0011.1.xml. So it's possible that when running doWeb for multiple, the page hasn't fully loaded, so requestDocument(url) is only returning some parts of the page, causing EM to fall back on function addLowQualityMetadata. Solutions to this problem are discussed here (https://groups.google.com/g/zotero-dev/c/yIVWbu0xNlQ/m/THjnupckAwAJ) and here (https://groups.google.com/g/zotero-dev/c/EnC5teZkSP8/m/dmygGUMIBQAJ), with gist being "Use the dev tools to figure out how the site is generating the content dynamically and then do the same thing in your code."

I've tried doing this, but can't find an API request or JSON that the page is being generated from when I look at Network traffic.

Another possibility: I found this thread from you, @adam3smith, on a problem with the AMS translator, which no longer seems to exist: https://groups.google.com/g/zotero-dev/c/2gCD5-mM_Cg/m/tbJH6siaUWkJ. If we're indeed talking about the same AMS here, it appears that this could be the same issue with the content-type meta tag being improperly placed in the page. In that case, it would appear that there isn't a solution for this issue.

zoe-translates · 2023-07-18T11:08:54Z

I'm afraid that there's little you can do to give it a fix here, because the page you cited as an example wasn't valid HTML. Even if Firefox could "fix" it and display the page as if it were valid, its implementation of the JS API couldn't parse it in the intended way (although in this case there's no "correct" solution). And this is not just with Firefox -- neither Chromium nor Safari could do it.

In the EM translator, you have

	var metaTags = doc.head.getElementsByTagName("meta");
	Z.debug("Embedded Metadata: found " + metaTags.length + " meta tags.");
	if (forceLoadRDF /* check if this is called from doWeb */ && !metaTags.length) {
		if (doc.head) {
			Z.debug(doc.head.innerHTML
				.replace(/<style[^<]+(?:<\/style>|\/>)/ig, '')
				.replace(/<link[^>]+>/ig, '')
				.replace(/(?:\s*[\r\n]\s*)+/g, '\n')
			);
		}
		else {
			Z.debug("Embedded Metadata: No head tag");
		}
	}

Here the head property is on the HTMLDocument object obtained by parsing the fetched document into DOM. Because the input is malformed, the <meta> elements didn't go under head but body, so doc.head.getElementsByTagName("meta") doesn't get you anything (whereas doc.body.getElementsByTagName("meta") would have).

brendan-oconnell · 2023-07-19T14:17:59Z

Thanks for this explanation! When I look at the Inspector on the fully loaded page in the browser (https://journals.ametsoc.org/view/journals/amsm/59/1/amsmonographs-d-18-0011.1.xml), it appears that the <meta> tags are actually (correctly) in the head, and if I run var metaTags = document.head.getElementsByTagName("meta"); in the Console, it does return the correct meta elements. And I should have mentioned earlier - the translator, and by extension EM, works fine for single items. So do you know why this is only an issue for multiple?

zoe-translates · 2023-07-20T00:37:09Z

The Inspector shows a live version of the DOM, which isn't just the result of parsing the document source. First the browser will try to parse the source, and then apply sanitization, to "fix" the errors it can fix. And on top of that, the document may be mutated by JS code executed by the browser.

But the parser features exposed by the JS API does just that, the parsing. There's no browser, and therefore, there's no extra processing (sanitization of anomalies in the source), and no execution of the document's JS code after that.

So when translation is performed on the live document, the translator code has access to the live DOM via the browser. But when it must fetch some additional source and parse it using JS API (via Zotero functions like requestDocument), it will be looking at something quite different -- without the processing that turns the "raw" DOM into the "browserable" one.

The latter is what happens when the code processes multiple items.

One workaround is to apply some "fixing" by ourselves in the translator. For example:

translators/Philosopher's Imprint.js

Lines 82 to 87 in cc7c3c7

    
           function scrape(doc, url) { 
        
           	// move meta tags to head for EM 
        
           	// this is a fix for some malformed HTML 
        
           	for (let meta of doc.body.querySelectorAll('meta')) { 
        
           		doc.head.appendChild(meta); 
        
           	}

Here the code is dealing with exactly the same problem (malformed input -> <meta> that should have been in <head> ends up in <body>).

But this must be done carefully, if ever, lest it mutates the DOM of a live page.

(This hack is no longer necessary for the aboveshown translator because the sources have been fixed.)

zoe-translates · 2023-07-20T05:40:24Z

PubFactory Journals.js

+	let em = await translator.getTranslatorObject();
+	em.itemType = 'journalArticle';


I made a PR, introducing a flag to Embedded Metadata.js that tells it to look for the <meta> tags in the body in addition to the head: PR #3083.

If that one gets merged, here you could add

if (doc.querySelector("body > meta")) { em.searchForMetaTagsInBody = true; }

and that would be all you need to do in order to work around the bad HTML.

Thanks for this extremely clear explanation! This all makes sense now. So I'll wait to make a new PR with searchForMetaTagsInBody, once your PR is merged.

You're welcome <3 that PR is an "if" though. If it gets merged, you could do that. But if not, there's probably not going to be too much ill consequence with the "copy body meta tags to head" hack, if you restrict the hack to the scraping of multiples and leave the live-scraping of a doc unchanged (i.e. relying on the browser's sanitization).

(Merged now!)

brendan-oconnell · 2023-07-20T15:07:44Z

Hey, thanks for clarifying that - given that this translator doesn't address an urgent issue, I don't mind letting it sit for awhile and see if your PR gets merged. Otherwise, I'll try implementing the more hack-y solution.

PubFactory Journals.js

And update tests.

brendan-oconnell added 6 commits March 21, 2023 16:41

First commit - added detectWeb and EM import translator

983694b

Added comments

3796153

added selector for multiple

856d3e3

Added todo for add attached PDFs

f2dfbcf

Added PDF saving, test cases, removed Abstract from abstract field

0dfaa93

Deleted out-of-date AMS translator file

1e988ba

zoe-translates reviewed Jul 4, 2023

View reviewed changes