Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate/integrate FW #1277

Open
dstillman opened this issue Mar 16, 2017 · 32 comments
Open

Deprecate/integrate FW #1277

dstillman opened this issue Mar 16, 2017 · 32 comments

Comments

@dstillman
Copy link
Member

@owcz has been writing lots of new translators using FW. That's great (thanks!), but before we go too much further with that, I wanted to get @adam3smith's and @zuphilip's thoughts on FW. I know @simonster was never really a fan of it, both because it's limited in what it can do and because he thought that for people who knew basic JS it might actually be harder than writing a translator without it, but you guys have worked with translators more recently, so we'd be interested in your perspective. If we do think there are benefits, though, I think we should consider finding a way to achieve those in the main architecture rather than having this large chunk of code duplicated into lots of different translators.

@adam3smith
Copy link
Collaborator

adam3smith commented Mar 16, 2017

@owcz is very insistent that FW translators are easier to maintain/write. That certainly used to be the case -- my first 10 translators were all FW -- but I'm curious (actually) why they still think that's the case.

With the ZU.xpathText function as well as @zuphilip 's Common code blocks, I find regular translators just as quick, including for scrapers, and they don't have any of the limitations (and ideosyncracies, which I assume Simon disliked), so I could certainly live without FW easily (and I don't use it anymore).

In my view, we could just put up a regular translator skeleton for scrapes (along the line of what Philipp already has for other types) that emulates FW with regular js code. Using a good template also has the advantage of increasing readability/maintainability. I find the translators that follow Philipp's codeblocks particularly easy to work with.

Philipp will chime in himself, I'm sure, but I know he's never been a fan of FW.

@owcz
Copy link
Contributor

owcz commented Mar 16, 2017

I agree with the principle of FW: removing boilerplate and making the translator process more approachable to those who may be put off by curly braces. I wouldn't say I'm insistent about it—I certainly see its limitations—but in my estimation, it would be easier to train (and assist/review) contributors to focus on hunting XPaths for FW translators than it would be to familiarize each with JS and the Zotero libraries (especially without more attention to their documentation).

In particular, Wikipedia will soon rely on Zotero's translators for handling automatic reference parsing in its new wikicode-less user interface. Accordingly, while the major academic sources are handled quite well, there will be much stress over all kinds of popular sites with non-compliant metadata complicating an otherwise large leap in productivity. If FW didn't exist, I certainly would have wanted someone to create it.

@zuphilip
Copy link
Contributor

Sebastian told quite accurate my views on FW. To be fair, I have also to say, that I didn't spend much time to learn the FW really (e.g. the after-hooks I just have seen some days ago). So, personally I will not create any new FW-translators, and moreover I mostly also skip reviewing them or updating them. Also to keep the framework up-to-date, we have to change it in each translator, which seems also not ideal.

In general I prefer good documentation to over any mystery-functions (and conventions by name). Maybe we should start to update the documentation for translator coding or write a new one (and I would prefer to do that in GitHub directly). My common code blocks for translator coding could maybe expanded to a general scraping routine with XPath imitating what FW does (BTW we could also use the wiki here in the official repo, if you want to activate that).

The current PR by @owcz share another thing which I am a little sceptical: They mostly deal with blogs or magazines. For example this blogpost http://www.digitalspy.com/gaming/xbox-one/news/a661615/rare-replay-review-roundup-one-of-the-best-collections-in-gaming-history/ seems to give already quite good data with the Embedded Metadata (EM) translator. Is EM now working and activated with Citoid? (I remember there were some problems...) Moreover, it looks that the data extraction can be improved by supporting JSON-LD and/or schema.org. I would rather prefer to minimize the number of special translators for blogs and try to improve the general translators like EM to support the majority of blogs better.

@owcz
Copy link
Contributor

owcz commented Mar 16, 2017

Citoid+EM is blocked, I believe, by zotero/translation-server/issues/31 (T140539 at WM), but I imagine the goal is to get it up eventually. None of my recent submissions fit fully under EM. Some mostly work with EM apart for one or two key fields (usually author or date) and thus need separate translators. (I went over their XPaths with Sebastian some time ago and he found those not standard enough to merit integrating into EM.) Digital Spy, for instance, is mostly fine with EM, apart for where it butchers its creator field (and, of course, no multi). Some of the translators would be resolved with Schema, but I was under the impression that was stalled for the time being (#366). I'm a fan of using the GitHub wiki too, if there is reciprocated interest.

@dstillman
Copy link
Member Author

dstillman commented Mar 16, 2017

Interesting question from @fcheslack: any reason we're still preferring XPaths in scrapers to CSS selectors (via querySelector and querySelectorAll), which are now universally supported? Obviously using embedded metadata and import translators is better, but when that's not possible, seems like query selectors would make scraper development even easier — and another good reason to deprecate the FW.

@adam3smith
Copy link
Collaborator

No reason I'm aware of. This should work already, right? So we could just re-write a simple translator to see how it looks?

@dstillman
Copy link
Member Author

Yup.

@zuphilip
Copy link
Contributor

How can we then write ZU.xpathText as an nice one liner? E.g.

item.abstractNote = ZU.xpathText(doc, '//div[@class="abstract"]');

(We have AFAIK a lot of lines looking like this.)

@dstillman
Copy link
Member Author

dstillman commented Mar 16, 2017

item.abstractNote = doc.querySelector('div.abstract').textContent;

(It's also possible we could have some magic that automatically grabbed textContent from a passed DOM element, but that might be a bad idea.)

@dstillman
Copy link
Member Author

(And the div isn't required, of course — just .abstract would be fine, and fast, because class selectors are highly efficient.)

@zuphilip
Copy link
Contributor

Yes, that is also what I came up. But often these elements are not present on every page. Because querySelector is returning null when there is no matching element, this code can give an error. We would need to check first that the output is not null and only then can access the textContent, i.e. more variables, more lines...

@dstillman
Copy link
Member Author

True. So then magic textContent extraction from a passed DOM element might not be a bad idea. Or could add a helper function to make it more explicit. In any case, definitely not an argument against switching to CSS selectors.

@dstillman
Copy link
Member Author

dstillman commented Mar 16, 2017

Magic extraction would also be helpful if we ever do things that require querySelectorAll, so you could do something like item.abstractNote = doc.querySelectorAll('#foo .bar')[1]; — a helper function like item.abstractNote = ZU.selectorText(doc, '#foo .bar'); wouldn't work in that case, unless it also took an optional index. (Using an index that way seems pretty brittle, though, so maybe this isn't an issue.)

For a helper function, we might be able to add a function into translator scope with doc automatically passed, so you could just do something like item.abstractNote = text('.abstract');.

@dstillman
Copy link
Member Author

Actually, now that I write that out, text('.abstract') and text('#foo .bar', 1) seem pretty nice.

@egh
Copy link

egh commented May 21, 2017

Let me know if you need anything from me. I'm sure there are some major missing conceptual elements in the framework, the code is not well documented or commented, etc., so I am glad to help if I can.

@adam3smith
Copy link
Collaborator

@egh -- I think the main thinking is that Zotero helper functions have gotten much better (in particular ZU.xpath and ZU.xpathText), that site metadata across the internet has gotten so much better that it makes sense to call the embedded metadata translator almost everywhere as a basis for translators, and that as Dan points out FW currently causes massive code duplication (the same translator with and without FW is about 25 vs. 5 kb).
This is really nothing against FW -- it's a neat tool and I don't think I'd have started working on translators without it. But I do think that we should be able to achieve its main benefits by offering a couple of well-commented templates at this point.

@egh
Copy link

egh commented May 30, 2017

@adam3smith I wasn't really clear - I totally understand the point of phasing it out, and that it doesn't fit it that well with the zotero ecosystem. But if you do decide otherwise, let me know if there's anything I can do to help update or otherwise integrate

It makes me happy to know that I was able to help get you involved in translators! Thanks.

@owcz
Copy link
Contributor

owcz commented Jul 8, 2017

Is there an estimate on that text() function? Getting ready to write some translators but would rather give the work some longevity by using the new methods

@adam3smith
Copy link
Collaborator

I don't see ZU.xpath and ZU.xpathText getting deprecated so you can just use those. Follow https://github.com/zuphilip/translators/wiki/Common-code-blocks-for-translators for best practices and easy review.
I have on my agenda to more fully spell these out as templates so they can actually replace FW but no ETA.

I find xpaths very intuitive by now, so adding the utility function using CSS selectors isn't high priority for me; not sure what Dan's thinking is on this.

@owcz
Copy link
Contributor

owcz commented Jul 8, 2017

I can handle writing without FW but I also teach others, so it's in my interest to simplify the writing process.

One of FW's benefits was in multiples:

FW.MultiScraper({
itemType         : 'multiple',
detect           : FW.Url().match(/\/news\//), // news page
choices          : {
  titles  :  FW.Xpath('//div[@class="title"]/h2/a').text().trim(),
  urls    :  FW.Xpath('//div[@class="title"]/h2/a').key("href").trim()
  }
});

which is much easier to visually parse than the common code block:

> function getSearchResults(doc, checkOnly) {
	var items = {};
	var found = false;
	//TODO: adjust the xpath
	var rows = ZU.xpath(doc, '//a[contains(@href, "/article/")]');
	for (var i=0; i<rows.length; i++) {
		//TODO: check and maybe adjust
		var href = rows[i].href;
		//TODO: check and maybe adjust
		var title = ZU.trimInternal(rows[i].textContent);
		if (!href || !title) continue;
		if (checkOnly) return true;
		found = true;
		items[href] = title;
	}
	return found ? items : false;
}


function doWeb(doc, url) {
	if (detectWeb(doc, url) == "multiple") {
>		Zotero.selectItems(getSearchResults(doc, false), function (items) {
>			if (!items) {
>				return true;
>			}
>			var articles = [];
>			for (var i in items) {
>				articles.push(i);
>			}
>			ZU.processDocuments(articles, scrape);
		});
	} else {
		scrape(doc, url);
	}
}

especially given that the XPath/selector (to point to the multi's URLs) are almost always the only part of the multi that changes, would it be sensible/possible to break these two loops into two default subroutines/functions as well?

@dstillman
Copy link
Member Author

dstillman commented Jul 8, 2017

I can work on this. I think CSS selectors and DOM methods are much easier and more familiar to people, and they can be generated using the developer tools built into the browser. I don't think new translators should use XPaths (except for XML documents).

Any new functions will need to be included as polyfills for now, since they won't exist in 4.0. One catch is that, as polyfills, they won't have doc automatically bound, so they'll need to take it as an argument. But the real functions can easily be made to work with either signature, so we could start writing these with doc (which actually makes porting from ZU.xpathText() more straightforward) and then remove it once the built-in functions exist.

I've just spent a few minutes playing around with this with various translators, and it seems most things can be done with two helper functions:

function attr(doc, selector, attr, index) {
	if (index > 0) {
		var elem = doc.querySelectorAll(selector).item(index);
		return elem ? elem.getAttribute(attr) : null;
	}
	var elem = doc.querySelector(selector);
	return elem ? elem.getAttribute(attr) : null;
}

function text(doc, selector, index) {
	if (index > 0) {
		var elem = doc.querySelectorAll(selector).item(index);
		return elem ? elem.textContent : null;
	}
	var elem = doc.querySelector(selector);
	return elem ? elem.textContent : null;
}

(index can be omitted for both to default to 0.)

It's trivial to replace current uses of ZU.xpath()/ZU.xpathText() (or the FW equivalents) that I found with these, and they generally become shorter and/or cleaner. text() can even be used in places where we now use doc.getElementById(…).textContent. For loops that used ZU.xpath(), doc.querySelectorAll() can be used instead.

If you want to use these as a polyfill, here's some minified code:

// attr()/text()
function attr(doc,selector,attr,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.getAttribute(attr):null}var elem=doc.querySelector(selector);return elem?elem.getAttribute(attr):null}function text(doc,selector,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.textContent:null}var elem=doc.querySelector(selector);return elem?elem.textContent:null}

If you do try these, let me know if you run into any trouble or if there are things we can optimize further.

@zuphilip
Copy link
Contributor

zuphilip commented Jul 9, 2017

We can simplify the two functions into one and reduce the code duplication a little further. Moreover, I would suggest to use a less generic name for the function, e.g.

/* 
 * Query the CSS path `selector` in the document `doc`
 * possibly at only element at index `index` and returns either
 *  i) a specific attribute `attr`
 *  ii) the text under this node
 *  iii) `null` if there is no matching node/attribute
*/
function returnCssPathValue(doc, selector, attr, index) {
	var elem = index > 0 ? doc.querySelectorAll(selector).item(index) : doc.querySelector(selector);
	if (attr) {
		return elem ? elem.getAttribute(attr) : null;
	} else {
		return elem ? elem.textContent : null;
	}
}

My first notes here are: 1) What happens when we call getAttribute on a node list (i.e. index is not given)? 2) We also use quite often relative xpath and this seems to be more problematic with queryselector, see https://developer.rackspace.com/blog/using-querySelector-on-elements/ .

@dstillman
Copy link
Member Author

dstillman commented Jul 9, 2017

I strongly disagree on the name — the whole point of this is to be short and easy to type and read, since it can appear many times in a translator.

As for combining the two, I don't see much point to that. The goal is to make translators as easy to write as possible. An extra tiny function in the translator architecture (where this will end up) doesn't matter at all, and it seems easier to me to just specify whether you want an attribute or text than to mix concerns and worry about an additional positional argument (which you would also always have to set as null to use an index for text).

  1. What happens when we call getAttribute on a node list (i.e. index is not given)?

I'm not sure what you mean by that. You can't call getAttribute on a NodeList. Example?

  1. We also use quite often relative xpath and this seems to be more problematic with queryselector

Not really problematic — you have to remember that the selectors are still from the root (matching only elements below the given element), but most of the time that won't matter. But I think the functions above should work fine with an element instead of a doc (though we should rename the parameter to reflect that). We can leave in support for an optional first element in the built-in versions even after doc becomes unnecessary.

@owcz
Copy link
Contributor

owcz commented Jul 9, 2017

I agree with keeping the name short, but I also agree with combining the functions (such that attr() works like text() when no attribute is specified, i.e., the function defaults to .textContent over .getAttribute). It's a proprietary function anyway, so might as well make it simpler.

@owcz owcz mentioned this issue Jul 10, 2017
@zuphilip
Copy link
Contributor

zuphilip commented Jul 10, 2017

The function text is claimed to be already present in Zotero.Utilities, also I don't know if this list is up-to-date: https://www.zotero.org/support/dev/translators/functions .

Shouldn't you test on the existing of index rather than index > 0? E.g. for the first div-element I have to call text(doc, 'div', 0) because the index in item is zero-based.

Also if you don't want to combine the two function, you can still write it more condense without the code duplication, e.g.

function text(doc, selector, index) {
	var elem = index ? doc.querySelectorAll(selector).item(index) : doc.querySelector(selector);
	return elem ? elem.textContent : null;
}

The attr function:

I'm not sure what you mean by that. You can't call getAttribute on a NodeList. Example?

Yes, but I could call the function with these parameters attr(doc, 'div', 'class') which would then result in error. We could either use 0 as default index in attr or not returning anything but giving a debug message. However, currently this case is not covered in your code above.

Update: Sorry, my fault, I didn't see that you use querySelector in this case which will return the first element.

@dstillman
Copy link
Member Author

The function text is claimed to be already present in Zotero.Utilities, also I don't know if this list is up-to-date

I assume that was a mistake. There's a text2html function.

Shouldn't you test on the existing of index rather than index > 0? E.g. for the first div-element I have to call text(doc, 'div', 0) because the index in item is zero-based.

You didn't cross this out, but you don't need to include the , 0 for the same reason — it will use querySelector, which returns the first element. But testing for just index is indeed sufficient.

Also if you don't want to combine the two function, you can still write it more condense without the code duplication

Sure. I wrote these quickly as examples, but condensed versions would be fine:

function attr(docOrElem, selector, attr, index) {
	var elem = index ? docOrElem.querySelectorAll(selector).item(index) : docOrElem.querySelector(selector);
	return elem ? elem.getAttribute(attr) : null;
}

function text(docOrElem, selector, index) {
	var elem = index ? docOrElem.querySelectorAll(selector).item(index) : docOrElem.querySelector(selector);
	return elem ? elem.textContent : null;
}

Minified:

// attr()/text() v2
function attr(docOrElem,selector,attr,index){var elem=index?docOrElem.querySelectorAll(selector).item(index):docOrElem.querySelector(selector);return elem?elem.getAttribute(attr):null}function text(docOrElem,selector,index){var elem=index?docOrElem.querySelectorAll(selector).item(index):docOrElem.querySelector(selector);return elem?elem.textContent:null}

@avram
Copy link
Contributor

avram commented Jul 11, 2017

The function text is claimed to be already present in Zotero.Utilities, also I don't know if this list is up-to-date

I think I generated the first version of that page, and it's quite possible that my script misunderstood text2html as text, as @dstillman noted.

@zuphilip
Copy link
Contributor

Ok, I fixed the documentation: text2html is now in the list of functions.

@zuphilip
Copy link
Contributor

@dstillman With CSS-paths one cannot restrict on the text value of the nodes. We have to do this sometimes for scraping the information out of some unstructured websites. For example there can be a table where the first column contains the label and the second the value. Then we need to find the correct row according to the label and then go to its value. See e.g.

var inventors = ZU.xpath(doc, '//td[contains(text(), "Inventors")]/following-sibling::td/b|//th[contains(text(), "Inventors")]/following-sibling::td/b');
. Any ideas how to handle such cases in the future without xpaths? Do we need another helper function?

@dstillman
Copy link
Member Author

Well, we're not deprecating use of XPath or the XPath functions, just FW, since attr()/text() plus some item templates should be equivalent (and easier) for most translators. So that example wouldn't have to change.

The main reason to discourage XPaths would just be that they're another, more complicated, lesser-known technology than CSS selectors and ES6, which would use something like this:

Array.from(document.querySelectorAll('th, td'))
    .filter(el => el.textContent.startsWith('Inventors'))[0]
    .parentNode.querySelector('td:last-child')

I personally find that much more comfortable than XPaths. The main downside is that it would require a try/catch to avoid errors for optional values.

Maybe there are a few more common patterns that we could encapsulate this sort of logic for (e.g, a doc.getElementsByText(), which could take a CSS selector (th, td) and an exact string or regexp to filter by (/^Inventors/), but generally it's probably a better idea to just let people use standard JavaScript instead of learning our custom functions. Or people can keep using XPaths (which, after all, the Chrome devtools can still generate for you).

dstillman added a commit to zotero/zotero that referenced this issue Dec 27, 2017
The current document is automatically used (but can still be provided as
the first argument to avoid accidental bugs during the transition).

Closes #1323
Addresses zotero/translators#1277
dstillman added a commit to zotero/zotero-connectors that referenced this issue Dec 27, 2017
This adds attr()/text() to the translator sandbox for zotero/translators#1277.
@dstillman
Copy link
Member Author

In the next client/connector versions, attr()/text() will be automatically added to the translator sandbox, but we'll need to keep the polyfills in place until we drop support for 4.0 and people have upgraded to the latest versions.

@owcz
Copy link
Contributor

owcz commented Jul 8, 2018

I've rewritten all of my translators that had been pending for the last year because they used Framework

Should be good for review now, if you have time to take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants