Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sacramento Bee #1363

Merged
merged 6 commits into from
Jul 15, 2017
Merged

Add Sacramento Bee #1363

merged 6 commits into from
Jul 15, 2017

Conversation

owcz
Copy link
Contributor

@owcz owcz commented Jul 8, 2017

tests text()/attr() polyfill from #1277


// Authors
var authorMetadata = doc.querySelectorAll('.ng_byline_name');
if (authorMetadata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

querySelectorAll() always returns a NodeList, even if it's empty, so you would want to use authorMetadata.length here.

item.title = attr(doc,'[property="og:title"]','content');
item.date = text(doc,'.published-date');
item.abstractNote = text(doc,'#content-body- p');
item.tags = attr(doc,'meta[name="keywords"]','content').split(", ");
Copy link
Member

@dstillman dstillman Jul 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attr() will return null if the selector isn't found, which would produce an error, so it'd be safer to assign it to a keywords variable and then do if (keywords) { item.tags = keywords.split(", "); }.

Copy link
Collaborator

@adam3smith adam3smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some small things.


function scrape(doc, url) {
var item = new Zotero.Item("newspaperArticle");
item.websiteTitle = "The Sacramento Bee";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function detectWeb(doc, url) {
if (url.match(/article\d+/)) {
return "newspaperArticle";
} else if (url.match(/\/(news|sports|entertainment)\//)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be more comfortable if you implement a version of a getSearchResults function (using querySelectorAll instead of ZU.xpath, of course). These matches cover a broad set of pages and we really want to avoid false positives. There's a reason @zuphilip puts them in all translators.

Also, use .search(/regex/)!= -1 for efficient tests of strings against regexs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.search is more efficient than .match?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because it just has to return true/false vs. an actual match. See https://jsperf.com/exec-vs-match-vs-test-vs-search/5 (indexOf is dramatically more efficient, so you could also use that for this one since you don't really need a regex, just three separate strings)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mozilla actually makes this distinction explicitin in the docs

When you want to know whether a pattern is found and also its index in a string use search() (if you only want to know it exists, use the similar test() method on the RegExp prototype, which returns a boolean); for more information (but slower execution) use match() (similar to the regular expression exec() method).

Copy link
Contributor Author

@owcz owcz Jul 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I poked around and it looks like test() becomes (15%) more efficient than multiple indexOf()s when comparing for multiple strings

return "multiple";
} else if (url.indexOf("/search/?q=") != -1) {
return "multiple";
} else return null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no need for an explicit return null or false here)

item.tags = keywords.split(", ");
}
item.attachments.push({
title: "The Sacramento Bee snapshot",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitalize "snapshot" per convention.

"creatorType": "author"
}
],
"date": "January 08, 2015 1:09 PM",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ZU.strToISO to avoid importing English dates into non-English locales.

function detectWeb(doc, url) {
if (url.search(/article\d+/) != -1) {
return "newspaperArticle";
} else if (url.search(/(\/((news|sports|entertainment)\/)|(search\/\?q=))|sacbee\.com\/?$/) != -1 && getSearchResults(doc, true)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@owcz -- this here is half the reason we want the getSearchResults function (or something like it): by including it in detectWeb, you make sure you don't get false positives. Even if you're quite confident that you have the site covered as it is now, imagine they change their CMS -- with things as you had them, you could get false positives in all sorts of places, including on article pages where the generic metadata translator might perform OK otherwise.

@adam3smith
Copy link
Collaborator

This is good to merge, but I'll hold off for a little to see if the discussion in #1277 changes anything.

test() is most efficient for testing regex, more efficient than multiple indexOf()s: https://jsperf.com/zotero/1
@@ -40,9 +40,9 @@
function attr(doc,selector,attr,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.getAttribute(attr):null}var elem=doc.querySelector(selector);return elem?elem.getAttribute(attr):null}function text(doc,selector,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.textContent:null}var elem=doc.querySelector(selector);return elem?elem.textContent:null}

function detectWeb(doc, url) {
if (url.search(/article\d+/) != -1) {
if (/article\d+/.test(url) != false) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for != false -- that's what if does aready.

@adam3smith adam3smith merged commit ae41654 into zotero:master Jul 15, 2017
@adam3smith
Copy link
Collaborator

Cool, thanks!

zuphilip pushed a commit to zuphilip/translators that referenced this pull request Mar 28, 2018
* use test() in detectWeb
test() is most efficient for testing regex, more efficient than multiple indexOf()s: https://jsperf.com/zotero/1
zuphilip pushed a commit to zuphilip/translators that referenced this pull request Mar 28, 2018
* use test() in detectWeb
test() is most efficient for testing regex, more efficient than multiple indexOf()s: https://jsperf.com/zotero/1
@owcz owcz deleted the sacbee branch July 8, 2018 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants