Skip to content

Commit

Permalink
Added www.telegraph.co.uk and us.cnn.com scraper (#1)
Browse files Browse the repository at this point in the history
  • Loading branch information
JBGruber committed Sep 3, 2021
1 parent f184c1e commit 75a8c00
Show file tree
Hide file tree
Showing 5 changed files with 83 additions and 12 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: paperboy
Title: Comprehensive collection of news media scrapers
Version: 0.0.1.9000
Date: 2021-09-02
Date: 2021-09-03
Authors@R: person("Johannes", "Gruber", email = "[email protected]",
role = c("aut", "cre"))
Description: A comprehensive collection of webscraping scripts for news media sites.
Expand Down
12 changes: 8 additions & 4 deletions R/deliver_cnn_com.R
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,21 @@ pb_deliver_paper.edition_cnn_com <- function(x, verbose = NULL, ...) {

# datetime
datetime <- html %>%
rvest::html_elements("[name=\"pubdate\"]") %>%
rvest::html_elements("[name=\"pubdate\"],[name=\"parsely-pub-date\"]") %>%
rvest::html_attr("content") %>%
lubridate::as_datetime()

# headline
headline <- html %>%
rvest::html_elements(".pg-headline,.headline>h1,[id*=\"video-headline\"]") %>%
rvest::html_elements(".pg-headline,.headline>h1,[id*=\"video-headline\"],.headline__text") %>%
rvest::html_text2()

# author
author <- html %>%
rvest::html_elements("[name=\"author\"]") %>%
rvest::html_attr("content") %>%
toString()
toString() %>%
gsub("^By\\s", "", .)

# text
text <- html %>%
Expand All @@ -42,7 +43,8 @@ pb_deliver_paper.edition_cnn_com <- function(x, verbose = NULL, ...) {

if (nchar(text) == 0) {
text <- html %>%
rvest::html_elements("article") %>%
rvest::html_elements("article,.article__main") %>%
rvest::html_elements("p") %>%
rvest::html_text2() %>%
paste(collapse = "\n")
}
Expand Down Expand Up @@ -73,3 +75,5 @@ pb_deliver_paper.edition_cnn_com <- function(x, verbose = NULL, ...) {
normalise_df() %>%
return()
}

pb_deliver_paper.us_cnn_com <- pb_deliver_paper.edition_cnn_com
66 changes: 66 additions & 0 deletions R/deliver_telegraph_co_uk.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@

pb_deliver_paper.www_telegraph_co_uk <- function(x, verbose = NULL, ...) {

. <- NULL

if (is.null(verbose)) verbose <- getOption("paperboy_verbose")

if (!"tbl_df" %in% class(x))
stop("Wrong object passed to internal deliver function: ", class(x))

if (verbose) message("\t...", nrow(x), " articles from ", x$domain[1])

pb <- make_pb(x)

purrr::map_df(x$content_raw, function(cont) {

if (verbose) pb$tick()
html <- rvest::read_html(cont)

# datetime
datetime <- html %>%
rvest::html_element("[itemprop=\"datePublished\"]") %>%
{
out <- rvest::html_attr(., "content")
if (is.na(out)) {
out <- rvest::html_attr(., "datetime")
}
out
} %>%
as.POSIXct(format = "%Y-%m-%dT%H:%M%z")

# headline
headline <- html %>%
rvest::html_elements("[property=\"og:title\"]") %>%
rvest::html_attr("content")

# author
author <- html %>%
rvest::html_elements("[class*=\"byline__author\"]") %>%
rvest::html_attr("content") %>%
toString() %>%
gsub("^By\\s", "", .)

# text
text <- html %>%
rvest::html_elements("[class*=\"article-body-text\"]") %>%
rvest::html_text2() %>%
paste(collapse = "\n")

# type
content_type <- html %>%
rvest::html_element("[property=\"og:type\"]") %>%
rvest::html_attr("content")

tibble::tibble(
datetime,
author,
headline,
text,
content_type
)
}) %>%
cbind(x) %>%
normalise_df() %>%
return()
}
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,10 @@ therefore often encounter this warning:
``` r
pb_deliver("google.com")
#> Warning in
#> pb_deliver_paper.default(u, verbose
#> = verbose, ...): No method for
#> www.google.com yet. Url ignored.
#> pb_deliver_paper.default(u,
#> verbose = verbose, ...): No
#> method for www.google.com yet.
#> Url ignored.
```

If you enter a vector of multiple URLs, the unsupported ones will be
Expand Down Expand Up @@ -107,7 +108,7 @@ it via a pull request.
| pagesix.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| theguardian.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | Johannes B. Gruber | |
| time.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| us.cnn.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| us.cnn.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | Johannes B. Gruber | |
| washingtonpost.com | ![](https://img.shields.io/badge/status-silver-%23C0C0C0.svg) | Johannes B. Gruber | [#3](https://github.com/JBGruber/paperboy/issues/3) |
| wsj.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | Johannes B. Gruber | |
| www.boston.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
Expand All @@ -119,7 +120,7 @@ it via a pull request.
| www.latimes.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | Johannes B. Gruber | |
| www.msnbc.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.sfgate.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | Johannes B. Gruber | |
| www.telegraph.co.uk | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.telegraph.co.uk | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | Johannes B. Gruber | |
| www.thelily.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.thismorningwithgordondeal.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.tribpub.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
Expand Down
4 changes: 2 additions & 2 deletions inst/status.csv
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"pagesix.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"theguardian.com","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","Johannes B. Gruber",""
"time.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"us.cnn.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"us.cnn.com","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","Johannes B. Gruber",""
"washingtonpost.com","![](https://img.shields.io/badge/status-silver-%23C0C0C0.svg)","Johannes B. Gruber","[#3](https://github.com/JBGruber/paperboy/issues/3)"
"wsj.com","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","Johannes B. Gruber",""
"www.boston.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
Expand All @@ -29,7 +29,7 @@
"www.latimes.com","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","Johannes B. Gruber",""
"www.msnbc.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.sfgate.com","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","Johannes B. Gruber",""
"www.telegraph.co.uk","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.telegraph.co.uk","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","Johannes B. Gruber",""
"www.thelily.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.thismorningwithgordondeal.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.tribpub.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"

0 comments on commit 75a8c00

Please sign in to comment.