-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sacramento Bee #1363
Add Sacramento Bee #1363
Conversation
Sacramento Bee.js
Outdated
|
||
// Authors | ||
var authorMetadata = doc.querySelectorAll('.ng_byline_name'); | ||
if (authorMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
querySelectorAll()
always returns a NodeList
, even if it's empty, so you would want to use authorMetadata.length
here.
Sacramento Bee.js
Outdated
item.title = attr(doc,'[property="og:title"]','content'); | ||
item.date = text(doc,'.published-date'); | ||
item.abstractNote = text(doc,'#content-body- p'); | ||
item.tags = attr(doc,'meta[name="keywords"]','content').split(", "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attr()
will return null
if the selector isn't found, which would produce an error, so it'd be safer to assign it to a keywords
variable and then do if (keywords) { item.tags = keywords.split(", "); }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Some small things.
Sacramento Bee.js
Outdated
|
||
function scrape(doc, url) { | ||
var item = new Zotero.Item("newspaperArticle"); | ||
item.websiteTitle = "The Sacramento Bee"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item.publicationTitle
cf. https://aurimasv.github.io/z2csl/typeMap.xml#map-newspaperArticle
Sacramento Bee.js
Outdated
function detectWeb(doc, url) { | ||
if (url.match(/article\d+/)) { | ||
return "newspaperArticle"; | ||
} else if (url.match(/\/(news|sports|entertainment)\//)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be more comfortable if you implement a version of a getSearchResults
function (using querySelectorAll
instead of ZU.xpath
, of course). These matches cover a broad set of pages and we really want to avoid false positives. There's a reason @zuphilip puts them in all translators.
Also, use .search(/regex/)!= -1
for efficient tests of strings against regexs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.search
is more efficient than .match
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because it just has to return true/false vs. an actual match. See https://jsperf.com/exec-vs-match-vs-test-vs-search/5 (indexOf
is dramatically more efficient, so you could also use that for this one since you don't really need a regex, just three separate strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mozilla actually makes this distinction explicitin in the docs
When you want to know whether a pattern is found and also its index in a string use search() (if you only want to know it exists, use the similar test() method on the RegExp prototype, which returns a boolean); for more information (but slower execution) use match() (similar to the regular expression exec() method).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I poked around and it looks like test()
becomes (15%) more efficient than multiple indexOf()
s when comparing for multiple strings
Sacramento Bee.js
Outdated
return "multiple"; | ||
} else if (url.indexOf("/search/?q=") != -1) { | ||
return "multiple"; | ||
} else return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(no need for an explicit return null or false here)
Sacramento Bee.js
Outdated
item.tags = keywords.split(", "); | ||
} | ||
item.attachments.push({ | ||
title: "The Sacramento Bee snapshot", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capitalize "snapshot" per convention.
Sacramento Bee.js
Outdated
"creatorType": "author" | ||
} | ||
], | ||
"date": "January 08, 2015 1:09 PM", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ZU.strToISO
to avoid importing English dates into non-English locales.
Sacramento Bee.js
Outdated
function detectWeb(doc, url) { | ||
if (url.search(/article\d+/) != -1) { | ||
return "newspaperArticle"; | ||
} else if (url.search(/(\/((news|sports|entertainment)\/)|(search\/\?q=))|sacbee\.com\/?$/) != -1 && getSearchResults(doc, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@owcz -- this here is half the reason we want the getSearchResults
function (or something like it): by including it in detectWeb
, you make sure you don't get false positives. Even if you're quite confident that you have the site covered as it is now, imagine they change their CMS -- with things as you had them, you could get false positives in all sorts of places, including on article pages where the generic metadata translator might perform OK otherwise.
This is good to merge, but I'll hold off for a little to see if the discussion in #1277 changes anything. |
test() is most efficient for testing regex, more efficient than multiple indexOf()s: https://jsperf.com/zotero/1
Sacramento Bee.js
Outdated
@@ -40,9 +40,9 @@ | |||
function attr(doc,selector,attr,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.getAttribute(attr):null}var elem=doc.querySelector(selector);return elem?elem.getAttribute(attr):null}function text(doc,selector,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.textContent:null}var elem=doc.querySelector(selector);return elem?elem.textContent:null} | |||
|
|||
function detectWeb(doc, url) { | |||
if (url.search(/article\d+/) != -1) { | |||
if (/article\d+/.test(url) != false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for != false
-- that's what if
does aready.
Cool, thanks! |
* use test() in detectWeb test() is most efficient for testing regex, more efficient than multiple indexOf()s: https://jsperf.com/zotero/1
* use test() in detectWeb test() is most efficient for testing regex, more efficient than multiple indexOf()s: https://jsperf.com/zotero/1
tests text()/attr() polyfill from #1277