-
Notifications
You must be signed in to change notification settings - Fork 3
/
open_search_exploration.Rmd
208 lines (129 loc) · 5.68 KB
/
open_search_exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
title: "Open Search Exploration"
author: "Scott Brenstuhl"
date: "August 10, 2016"
output:
md_document:
variant: markdown_github
---
```{r libraries, alert=FALSE, message = FALSE, echo = FALSE}
library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(printr)
```
```{r echo = FALSE}
queries <- read_csv("data/all_queries.csv")
```
## Understanding the Data
This data is coming from Google analytics "Site Search" functionality, which is
hooked up to the SF Open Data portal's search page.
A lot of the useful information about this can be found at:
https://developers.google.com/analytics/devguides/reporting/core/dimsmets
#### ga.searchKeyword(Search Terms)
These are the words and keywords that have been entered into the search form at
https://data.sfgov.org/
This is what the bulk of our analysis will be covering so I will spare you the
summaries.
#### ga.searchStartPage(Search Page)
The page where users initiated an internal search.
aka Page on site where the user enters terms for a web search?
Pretty early on in the most common start page we get really specific pages:
```{r echo = FALSE}
# Do we need to multiply these by the searchUnique column for accurate counts?
# top20 <- table(queries$ga.searchStartPage) %>%
# sort(decreasing = TRUE) %>%
# head(20) %>%
# as.data.frame.table()
#
# names(top20) <- c('start_page', 'visits')
start_pages <- queries %>%
group_by(ga.searchStartPage) %>%
summarise(count = sum(ga.searchUniques)) %>%
arrange(desc(count))
head(start_pages)
```
So it's interesting to look at the root of the search page (dropping everything after the
second slash)
```{r echo=FALSE}
queries$startSearchRoot <- gsub('\\?.*', '', queries$ga.searchStartPage) %>%
gsub("'\\/", '', .) %>%
gsub("\\/.*", '', .)
startPageRoot <- queries %>%
group_by(startSearchRoot) %>%
summarise(count = sum(ga.searchUniques)) %>%
arrange(desc(count) )
head(startPageRoot, 20)
```
I'm very confused about how sometimes this is a specific data set? Basically it seems
like a lot of the start pages are specific data sets where there is not a search
widget such as in the ones below.
```{r}
mostly_datasets <- !grepl('\\?', queries$ga.searchStartPage) &
queries$ga.searchStartPage != "(entrance)" &
queries$ga.searchStartPage != "'/" &
str_count(queries$ga.searchStartPage, '/') > 1
head(queries[mostly_datasets,], 20)
# Uncomment and run to see all
# View(queries[mostly_datasets,])
```
The highest count of searchUnique for these is 27 (then 6 and trails off fast),
while there is roughly 1900 rows total.
#### ga.searchAfterDestinationPage(Search Destination Page)
The page that users visited after performing an internal search on the site.
```{r, echo=FALSE}
#head(unique(queries$ga.searchAfterDestinationPage),100)
head(sort(table(queries$ga.searchAfterDestinationPage), decreasing = TRUE), 20)
```
#### ga.searchUniques(Total Unique Searches)
The total number of times your site search was used. This excludes multiple
searches on the same keyword during the same session.
Oddly the most common number of searches is 0 (if it's been 0 times how is it
here??) After talking to Jason we have a theory that this is caused by when
people search for the same term multiple times in one session but do so from
different ga.searchStartPage or go to a different ga.searchAfterDestinationPage.
Support for this theory:
- If Jason pulls the same report without searchStartPage and
searchAfterDestinationPage then it is about 60K rows shorter (was working off
his memory)
- All terms with 0 searchUniques show up somewhere else with a searchUniques > 0
```{r}
filter(queries,ga.searchUniques == 0)$ga.searchKeyword %in%
filter(queries, !ga.searchUniques == 0)$ga.searchKeyword %>%
table()
```
Additional theory: Clicking back to a search page reruns the page and might
count it as a new search in the same session.
If our assumptions hold, we should be able to use these queries to look into
what terms people search and either don't find what they are looking for or to
narrow their explorations. (I think this is currently "what gets
searched the most", should likely figure out how to scale these based on the
values with > 0 searchUniques?)
```{r}
reran_searches <- filter(queries, ga.searchUniques == 0) %>%
group_by(search = ga.searchKeyword) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(head(reran_searches, 20), aes(x = reorder(search, -count), y = count)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = -45, hjust = 0)) +
labs(x = 'search terms')
```
#### ga.avgSearchResultViews(Results Pageviews / Search)
The average number of times people viewed a page as a result of a search.
#### ga.avgSearchDepth(Average Search Depth)
The average number of pages people viewed after performing a search.
#### ga.percentSearchRefinements(Search Refinements)
The percentage of the number of times a refinement (i.e., transition) occurs
between internal keywords search within a session.
#### ga.searchDuration(Time after Search)
The session duration when the site's internal search feature is used.
#### ga.searchExitRate(Search Exits)
The percentage of searches that resulted in an immediate exit from the property.
- What is the difference between searchAfterDestinationPage == (exit) and
searchExitRate == 100?
```{r}
#filter(queries,ga.searchAfterDestinationPage == '(exit)')
```