Lesson8.Rmd

---
params:
  lesson: "Lesson 8"
  title: "Scraping and manipulating text strings"  
  bookchapter_name: "Cheat sheet for the `stringr` package"      
  bookchapter_section: "https://stringr.tidyverse.org/"    
  functions: "`str_which`,`str_detect`,`str_locate`,`str_view`,`str_sub`"
  packages: "`stringr`"      
  # end inputs ---------------------------------------------------------------
header-includes: \usepackage{float}
always_allow_html: yes
output:
  html_document:
    code_folding: show
---
  
```{r, setup, echo = FALSE, cache = FALSE, include = FALSE}
options(width=100)
knitr::opts_chunk$set(
  eval = T, # run all code
  echo = TRUE, # show code chunks in output 
  tidy = TRUE, # make output as tidy
  message = FALSE,  # mask all messages
  warning = FALSE, # mask all warnings 
  comment = "",
  tidy.opts=list(width.cutoff=100), # set width of code chunks in output
  size="small" # set code chunk size
  )
```
\

<!-- install packages -->
```{r, load packages, eval=T, include=T, cache=F, message=F, warning=F, results='hide',echo=F}
# install.packages("pacman")
pacman::p_load(stringr,stringi,dplyr,reprex,xml2,rvest)
# reprex = for rendering text string in HTML    

```

<!-- ____________________________________________________________________________ -->
<!-- ____________________________________________________________________________ -->
<!-- ____________________________________________________________________________ -->
<!-- start body -->

# `r paste0(params$lesson,": ",params$title)`    
\  

Functions for `r params$lesson`  
`r params$functions`    
\    

Packages for `r params$lesson`          
`r params$packages`        
\    

# Agenda 

Use the `r params$packages` package to cut, substitute, print, and manipulate character and text strings in `R`. Useful for webscraping text from webpages, scraping PDFs and text files for given characters and words, mining genomics data, etc.      

[`r params$bookchapter_name`](`r params$bookchapter_section`).      
\  

<!-- ----------------------- image --------------------------- -->
<div align="center">
  <img src="img/stringr.png" style=width:50%>
</div>
<!-- ----------------------- image --------------------------- -->
\    


<!--  end yaml template------------------------------------------------------- -->  

Install necessary packages
```{r}
# install.packages("pacman") # uncomment and install this first
pacman::p_load(stringr,stringi,dplyr,reprex,xml2,rvest)
```


First, we need some text data. As an exercise, since we're using strings, we're going to use all the text from the webpage on using strings from the [R for Data Science textbook](https://r4ds.had.co.nz/strings.html) as our text sample.      
```{r, eval=T}
require(xml2)# read html data
require(rvest) # select html elements

url <- "https://r4ds.had.co.nz/strings.html"
txt <- url %>% read_html %>% html_text() # scrape web text from url  
txt %>% str 
txt %>% str_length() # get length of vector  
```


# Detecting strings       

Search for location of string patterns using `str_detect`, `str_which`, and `str_locate`   
```{r}
pat <- "strings" # string pattern to search for  
txt %>% str_detect(pat) # returns logical if vector contains that pattern 
txt %>% str_which(pat) # show which vector the pattern exists   
txt %>% str_locate(pat) # show character positions of the first instance of pattern 
txt %>% str_locate_all(pat) # show all positions   

```

# Subsetting strings  

Subset and cut up strings into manageable pieces   
```{r, collapse = T}
# subset string portion based on char position  
txt %>% str_sub(
  txt %>% str_locate(pat) # use positions from above func  
)

# insert user text into string position, e.g. between 1 and 2
str_sub(txt,1,2) <- "INSERT TEXT AT POSITION" 
```
  
Shorten text with ellipsis to nth character  
```{r}
txt_short <- txt %>% str_trunc(20) # munst be greater than 3 as this is the length of the ellipsis
txt_short
```

Return string as char vector containing pattern    
```{r, eval=F}
txt %>% str_subset(pat)   
```

Extract string patterns as characters  
```{r}
txt %>% str_extract(pat) # pull pattern out of string 
txt %>% str_extract_all(pat, simplify = F) # extract all patterns as string . set simplify = T to return matrix
txt %>% str_match(pat) # extract pattern as matrix 
txt %>% str_match_all(pat) # extract all pattern instances as matrix 
```

View an HTML rendering of the text using `str_view()`    
```{r}
# visualise the first 100 characters
txt %>% str_sub(1,100) %>% 
  str_view(" ")
```
\  


Split the text into separate components and apply the `str_sub` function to each new component    
```{r, eval=T}
# split into matrix at every instance of pattern  
txt_split <- txt %>% str_split_fixed(pat, n = Inf)
txt_split %>% dim # get dimensions of matrix 
txt_split[1,20] # view 1st row and 3rd column  

```

  
# Mutating and joining strings  
  
Replace pattern instances with new pattern    
```{r, eval=F}
repl <- "when you really need that coffee hit" # replacement character string
txt %>% 
  str_replace_all(pat,repl)

```
    
You'll notice that the first instance of the returned pattern is capitalised, so the replacement doesn't catch it and thus ignores the string. We can easily tell `R` to detect all instances of the pattern by ignoring case using `regex`      
```{r, eval=F}
pat_all <- regex(pat, ignore_case = T)
pat_all
txt %>% str_replace_all(pat_all,repl)
```
  
# Further useful functions  
  
Duplicate string  
```{r}
# use the smaller, split text 
txt_s <- txt_split[5]
txt_s %>% str_dup(3) # duplicate string n number of times (3)     

```

Removing white space and truncating text   
```{r}
txt_s %>% str_replace_all(" ","") # remove all spaces  
txt_s %>% str_trim(side="both") # strip white space from both ends  

```

## Alternative functions from `stringi` package      
```{r}
require(stringi)
txt_s %>% 
  stri_replace_all_charclass("\\p{WHITE_SPACE}","") # remove middle white space  
txt_s %>% str_replace_na() # change NAs into true "NA"    

```

## Including vectors within strings  

Insert numeric vectors without breaking character string  
```{r}
vect <- 1000
str_interp("For including vectors like this ${vect} when you can't break the character strng") 
```


Useful when breaking character quotes e.g. HTML tags        
```{r}
str_interp("<div style=\"color:#F90F40;\"> <strong> Total count </strong> ${vect}")
```

Include lists within function    
```{r}

str_interp("First value, ${v1}, Second value, ${v2*2}.",
  list(v1 = 10, v2 = 20)
)
```

And data frames  
```{r}  
str_interp(
  "Values are $[.2f]{max(Sepal.Width)} and $[.2f]{min(Sepal.Width)}.",
  iris
)  

```

# Regular expressions, i.e. regex    

You can find in-depth info on how to parse character vectors or strings or find specific character patterns using regular expressions in the [`R` for Data Science book](https://r4ds.had.co.nz/strings.html). There's also [a handy regex tool](https://regex101.com/r/ksY7HU/2) for live text parsing.