Skip to content

Commit

Permalink
Improved README
Browse files Browse the repository at this point in the history
  • Loading branch information
JBGruber committed Jul 13, 2021
1 parent 7cd995f commit c0377b7
Show file tree
Hide file tree
Showing 4 changed files with 59 additions and 28 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
^README\.Rmd$
/tests/local-files
^\.github$
^codecov\.yml$
32 changes: 23 additions & 9 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,28 @@ knitr::opts_chunk$set(
fig.path = "man/figures/README-",
out.width = "100%"
)
knit_print.tbl_df = function(x, ...) {
res = paste(c("", "", knitr::kable(x)), collapse = "\n")
knitr::asis_output(res)
}
registerS3method(
"knit_print", "tbl_df", knit_print.tbl_df,
envir = asNamespace("knitr")
)
```

# paperboy

<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/JBGruber/paperboy/workflows/R-CMD-check/badge.svg)](https://github.com/JBGruber/paperboy/actions)
[![Codecov test coverage](https://codecov.io/gh/JBGruber/paperboy/branch/main/graph/badge.svg)](https://codecov.io/gh/JBGruber/paperboy?branch=main)
<!-- badges: end -->

[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/JohannesBGruber.svg?style=social&label=Follow%20%40JohannesBGruber)](https://twitter.com/JohannesBGruber)

The philosophy of `paperboy` is that the package is a comprehensive collection of webscraping scripts for news media sites.
Many data scientist and researchers write their own code when they have to retrieve news media content from websites.
At the end of research projects, this code is often collecting digital dust on researchers hard drives instead of being made public for others to use.
Expand Down Expand Up @@ -50,7 +64,7 @@ Notice, that the function had no problem reading the link, even though it was sh
`paperboy` is an unfinished and even highly experimental package at the moment.
You will therefore often encounter this warning:

```{r nomethod}
```{r nomethod, results="hide"}
deliver(url = "google.com")
```

Expand All @@ -71,19 +85,19 @@ tibble::tribble(
```

Since some outlets will give you additional information, the `misc` column was included so these can be retained.
If you have a scaper you want to contribute, look in the list below if it already exists.
If you have a scraper you want to contribute, look in the list below if it already exists.
If it does not yet exist, you can become a co-author of this package by adding it via a pull request.

# Available Scrapers
## Available Scrapers

```{r available, echo=FALSE}
tibble::tribble(
~domain, ~status, ~author,
"theguardian.com", "Broken", "Johannes B. Gruber",
"huffingtonpost.co.uk", "Broken", "Johannes B. Gruber",
"buzzfeed.com", "Broken", "Johannes B. Gruber",
"forbes.com", "Broken", "Johannes B. Gruber",
)
~domain, ~status, ~author, ~note,
"theguardian.com", "Broken", "Johannes B. Gruber", "[#1](https://github.com/JBGruber/paperboy/issues/1)",
"huffingtonpost.co.uk", "Broken", "Johannes B. Gruber", "[#1](https://github.com/JBGruber/paperboy/issues/1)",
"buzzfeed.com", "Broken", "Johannes B. Gruber", "[#1](https://github.com/JBGruber/paperboy/issues/1)",
"forbes.com", "Broken", "Johannes B. Gruber", "[#1](https://github.com/JBGruber/paperboy/issues/1)",
)
```

- **Gold**: Runs without any issues
Expand Down
40 changes: 21 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,13 @@

[![Lifecycle:
experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/JBGruber/paperboy/workflows/R-CMD-check/badge.svg)](https://github.com/JBGruber/paperboy/actions)
[![Codecov test
coverage](https://codecov.io/gh/JBGruber/paperboy/branch/main/graph/badge.svg)](https://codecov.io/gh/JBGruber/paperboy?branch=main)
<!-- badges: end -->

[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/JohannesBGruber.svg?style=social&label=Follow%20%40JohannesBGruber)](https://twitter.com/JohannesBGruber)

The philosophy of `paperboy` is that the package is a comprehensive
collection of webscraping scripts for news media sites. Many data
scientist and researchers write their own code when they have to
Expand Down Expand Up @@ -39,12 +44,12 @@ links to a media article to the main function, `deliver`:
library(paperboy)
df <- deliver("https://tinyurl.com/386e98k5")
df
#> # A tibble: 1 x 8
#> url expanded_url domain datetime headline author text misc
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <list>
#> 1 NA NA NA NA NA NA NA <tibble [1 × 1]>
```

| url | expanded\_url | domain | status | datetime | headline | author | text | misc |
|:-------------------------------|:----------------------------------------------------------------------------------|:--------------------|:-------|:---------|:---------|:-------|:-----|:-----|
| <https://tinyurl.com/386e98k5> | <https://www.theguardian.com/tv-and-radio/2021/jul/12/should-marge-divorce-homer> | www.theguardian.com | NA | NA | NA | NA | NA | 200 |

The returned `data.frame` contains important meta information about the
news items and their full text. Notice, that the function had no problem
reading the link, even though it was shortened. `paperboy` is an
Expand All @@ -55,7 +60,6 @@ therefore often encounter this warning:
deliver(url = "google.com")
#> Warning in deliver.default(u, ...): No method for www.google.com yet. Url
#> ignored.
#> # A tibble: 0 x 0
```

If you enter a vector of multiple URLs, the unsupported ones will be
Expand All @@ -67,27 +71,25 @@ column will be different from `200` and contain `NA`s.

Every webscraper should retrieve a `tibble` with the following format:

#> # A tibble: 2 x 9
#> url expanded_url domain status datetime headline author text misc
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 charac… character charac… integer as.POSIX… charact… chara… char… list
#> 2 the or… the full url the do… http s… publicat… the hea… the a… the … all othe…
| url | expanded\_url | domain | status | datetime | headline | author | text | misc |
|:------------------------------------|:--------------|:-----------|:-----------------|:---------------------|:-------------|:-----------|:--------------|:--------------------------------------------------------------------------|
| character | character | character | integer | as.POSIXct | character | character | character | list |
| the original url fed to the scraper | the full url | the domain | http status code | publication datetime | the headline | the author | the full text | all other information that can be consistently found on a specific outlet |

Since some outlets will give you additional information, the `misc`
column was included so these can be retained. If you have a scaper you
column was included so these can be retained. If you have a scraper you
want to contribute, look in the list below if it already exists. If it
does not yet exist, you can become a co-author of this package by adding
it via a pull request.

# Available Scrapers
## Available Scrapers

#> # A tibble: 4 x 3
#> domain status author
#> <chr> <chr> <chr>
#> 1 theguardian.com Broken Johannes B. Gruber
#> 2 huffingtonpost.co.uk Broken Johannes B. Gruber
#> 3 buzzfeed.com Broken Johannes B. Gruber
#> 4 forbes.com Broken Johannes B. Gruber
| domain | status | author | note |
|:---------------------|:-------|:-------------------|:-----------------------------------------------------|
| theguardian.com | Broken | Johannes B. Gruber | [\#1](https://github.com/JBGruber/paperboy/issues/1) |
| huffingtonpost.co.uk | Broken | Johannes B. Gruber | [\#1](https://github.com/JBGruber/paperboy/issues/1) |
| buzzfeed.com | Broken | Johannes B. Gruber | [\#1](https://github.com/JBGruber/paperboy/issues/1) |
| forbes.com | Broken | Johannes B. Gruber | [\#1](https://github.com/JBGruber/paperboy/issues/1) |

- **Gold**: Runs without any issues
- **Silver**: Runs with some issues
Expand Down
14 changes: 14 additions & 0 deletions codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
comment: false

coverage:
status:
project:
default:
target: auto
threshold: 1%
informational: true
patch:
default:
target: auto
threshold: 1%
informational: true

0 comments on commit c0377b7

Please sign in to comment.