Skip to content

Commit

Permalink
Added sfgate.com scraper (#1)
Browse files Browse the repository at this point in the history
  • Loading branch information
JBGruber committed Aug 31, 2021
1 parent 5506594 commit 2a963a4
Show file tree
Hide file tree
Showing 5 changed files with 65 additions and 9 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: paperboy
Title: Comprehensive collection of news media scrapers
Version: 0.0.1.9000
Date: 2021-08-27
Date: 2021-08-31
Authors@R: person("Johannes", "Gruber", email = "[email protected]",
role = c("aut", "cre"))
Description: A comprehensive collection of webscraping scripts for news media sites.
Expand Down
4 changes: 2 additions & 2 deletions R/deliver_cnn_com.R
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ pb_deliver_paper.edition_cnn_com <- function(x, verbose = NULL, ...) {
}

# type
type <- html %>%
content_type <- html %>%
rvest::html_element("[property=\"og:title\"]") %>%
rvest::html_attr("content") %>%
toString() %>% {
Expand All @@ -66,7 +66,7 @@ pb_deliver_paper.edition_cnn_com <- function(x, verbose = NULL, ...) {
author,
headline,
text,
type
content_type
)
}) %>%
cbind(x) %>%
Expand Down
53 changes: 53 additions & 0 deletions R/deliver_sfgate_com.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@

pb_deliver_paper.www_sfgate_com <- function(x, verbose = NULL, ...) {

. <- NULL

if (is.null(verbose)) verbose <- getOption("paperboy_verbose")

if (!"tbl_df" %in% class(x))
stop("Wrong object passed to internal deliver function: ", class(x))

if (verbose) message("\t...", nrow(x), " articles from ", x$domain[1])

pb <- make_pb(x)

purrr::map_df(x$content_raw, function(cont) {

if (verbose) pb$tick()
html <- rvest::read_html(cont)

# datetime
datetime <- html %>%
rvest::html_elements("[name=\"sailthru.date\"]") %>%
rvest::html_attr("content") %>%
lubridate::as_datetime()

# headline
headline <- html %>%
rvest::html_elements("[property=\"sailthru.title\"]") %>%
rvest::html_attr("content")

# author
author <- html %>%
rvest::html_elements("[name=\"sailthru.author\"]") %>%
rvest::html_attr("content") %>%
toString()

# text
text <- html %>%
rvest::html_elements("p") %>%
rvest::html_text2() %>%
paste(collapse = "\n")

tibble::tibble(
datetime,
author,
headline,
text
)
}) %>%
cbind(x) %>%
normalise_df() %>%
return()
}
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,11 @@ therefore often encounter this warning:

``` r
pb_deliver("google.com")
#> Warning in pb_deliver_paper.default(u, verbose = verbose, ...): No method for
#> www.google.com yet. Url ignored.
#> Warning in
#> pb_deliver_paper.default(u,
#> verbose = verbose, ...): No
#> method for www.google.com yet.
#> Url ignored.
```

If you enter a vector of multiple URLs, the unsupported ones will be
Expand Down Expand Up @@ -106,7 +109,7 @@ it via a pull request.
| theguardian.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | Johannes B. Gruber | |
| time.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| us.cnn.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| washingtonpost.com | ![](https://img.shields.io/badge/status-silver-%23C0C0C0.svg) | Johannes B. Gruber | [#2](https://github.com/JBGruber/paperboy/issues/3) |
| washingtonpost.com | ![](https://img.shields.io/badge/status-silver-%23C0C0C0.svg) | Johannes B. Gruber | [#3](https://github.com/JBGruber/paperboy/issues/3) |
| wsj.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | Johannes B. Gruber | |
| www.boston.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.bostonglobe.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
Expand All @@ -116,7 +119,7 @@ it via a pull request.
| www.foxnews.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.latimes.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.msnbc.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.sfgate.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.sfgate.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | Johannes B. Gruber | |
| www.telegraph.co.uk | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.thelily.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
| www.thismorningwithgordondeal.com | ![](https://img.shields.io/badge/status-broken-%23D8634C) | Johannes B. Gruber | [#1](https://github.com/JBGruber/paperboy/issues/1) |
Expand Down
4 changes: 2 additions & 2 deletions inst/status.csv
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"theguardian.com","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","Johannes B. Gruber",""
"time.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"us.cnn.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"washingtonpost.com","![](https://img.shields.io/badge/status-silver-%23C0C0C0.svg)","Johannes B. Gruber","[#2](https://github.com/JBGruber/paperboy/issues/3)"
"washingtonpost.com","![](https://img.shields.io/badge/status-silver-%23C0C0C0.svg)","Johannes B. Gruber","[#3](https://github.com/JBGruber/paperboy/issues/3)"
"wsj.com","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","Johannes B. Gruber",""
"www.boston.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.bostonglobe.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
Expand All @@ -28,7 +28,7 @@
"www.foxnews.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.latimes.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.msnbc.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.sfgate.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.sfgate.com","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","Johannes B. Gruber",""
"www.telegraph.co.uk","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.thelily.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
"www.thismorningwithgordondeal.com","![](https://img.shields.io/badge/status-broken-%23D8634C)","Johannes B. Gruber","[#1](https://github.com/JBGruber/paperboy/issues/1)"
Expand Down

0 comments on commit 2a963a4

Please sign in to comment.