-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #6 from UtrechtUniversity/master
UU scripts
- Loading branch information
Showing
25 changed files
with
422,482 additions
and
100 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,12 @@ | ||
|
||
*.pyc | ||
data_extractor/tests/data/BrowserHistory.json | ||
data_extractor/.coverage | ||
data_extractor/.ipynb_checkpoints/ | ||
data_extractor/google_search_history/POC.py | ||
.coverage | ||
.python-version | ||
Lib/ | ||
Scripts/ | ||
.vscode/ | ||
pyvenv.cfg |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,29 @@ | ||
# Eyra Port POC | ||
|
||
## Description | ||
This proof-of-concept shows how Python can be used, from within a web-browser, | ||
to extract data for research purposes. The end-user (data donator) needs only | ||
to have a modern web-browser. | ||
|
||
An application is provided which creates a basic user-interface for selecting a | ||
file. This file is then submitted to a Python interperter. The relevant file | ||
file. This file is then submitted to a Python interpreter. The relevant file | ||
API's of the web-browser have been wrapped so that the Python code does not | ||
need to be aware of running in a web-browser. | ||
|
||
Example code for the Python part of the project is provided in the | ||
`data_extractor` folder. This also contains a *README* file which explains how | ||
to modify this code. | ||
[data_extractor](data_extractor/) folder. This also contains a [README](data_extractor/README.md) | ||
which explains how to modify this code and presents two examples of data extraction scripts, | ||
namely for Google Semantic Location History and Google Search History data. | ||
|
||
To run the application execute the following command from the checkout: | ||
|
||
python3 -m http.server | ||
|
||
This launches a web-server that can be access on | ||
This launches a web-server that can be accessed on | ||
[http://localhost:8000](http://localhost:8000). | ||
|
||
|
||
## Contributors | ||
Port POC is developed by [Eyra](https://github.com/eyra). The example data extraction scripts for | ||
Google Semantic Location History and Google Search History data packages are developed by the | ||
[Research Engineering team](https://github.com/orgs/UtrechtUniversity/teams/research-engineering) of Utrecht University. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# .coveragerc to control coverage.py | ||
|
||
[report] | ||
# Regexes for lines to exclude from consideration | ||
exclude_lines = | ||
# Have to re-enable the standard pragma | ||
pragma: no cover | ||
|
||
# Don't complain if non-runnable code isn't run: | ||
if 0: | ||
if __name__ == .__main__.: | ||
|
||
|
||
[run] | ||
omit = */main.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,7 @@ | ||
[FORMAT] | ||
|
||
# Maximum number of characters on a single line. | ||
max-line-length=100 | ||
max-line-length=100 | ||
|
||
# Maximum number of locals for function / method body | ||
max-locals=20 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
# Data Extractor | ||
|
||
This is a basic template which can be used to write a data extractor. | ||
The extraction logic should be placed in the `process` function within | ||
`data_extractor/__init__.py`. An example has been provided. | ||
|
||
The argument that the `process` function receives is a file-like object. | ||
It can therefore be used with most regular Python libraries. The example | ||
demonstrates this by usage of the `zipfile` module. | ||
|
||
This project makes use of [Poetry](https://python-poetry.org/). It makes | ||
creating the required Python Wheel a straight-forward process. Install | ||
Poetry with the following command: `pip install poetry`. | ||
Then easily install the required python packages with `poetry install`. | ||
|
||
The behavior of the `process` function can be verified by running the | ||
tests. The test are located in the `tests` folder. To run the tests | ||
execute: `poetry run pytest`. | ||
|
||
To run the extraction code from the browser run: | ||
`python3 -m http.server` from the project root folder (the one with `.git`). | ||
This will start a webserver on: [localhost](http://localhost:8000). | ||
Opening a browser with that URL will initialize the application. After | ||
it has been loaded a file can be selected. The output of the process | ||
function will be displayed after a while (depending on the amount of | ||
processing required and the speed of the machine). | ||
|
||
# Examples | ||
|
||
## Google Semantic Location History | ||
In this example, we first create a simulated Google Semantic Location History | ||
(GSLH) data download package (DDP). Subsequently, we extract relevant information | ||
from the simulated DDP. | ||
|
||
### Data simulation | ||
Command: | ||
`poetry run python google_semantic_location_history/simulation_gslh.py` | ||
|
||
This creates a zipfile with the simulated Google Semantic Location | ||
History data in `tests/data/Location History.zip`. | ||
|
||
GSLH data is simulated using the python libraries | ||
[GenSON](<https://pypi.org/project/genson/>), | ||
[Faker](<https://github.com/joke2k/faker>), and | ||
[faker-schema](<https://pypi.org/project/faker-schema/>). | ||
The simulation script first generates a JSON schema from a JSON object using | ||
`GenSON's SchemaBuilder` class. The JSON object is derived from an | ||
example GSLH data package, downloaded from Google Takeout. The GSLH data | ||
package consists of monthly JSON files with information on e.g. | ||
geolocations, addresses and time spend in places and in activity. The | ||
JSON schema describes the format of the JSON files, and can be used to | ||
generate fake data with the same format. This is done by converting the | ||
JSON schema to a custom schema expected by `faker-schema` in the form of | ||
a dictionary, where the keys are field names and the values are the | ||
types of the fields. The values represent available data types in Faker, | ||
packed in so-called providers. `Faker` provides a wide variety of data | ||
types via providers, for example for names, addresses, and geographical | ||
data. This allows us to easily customize the faked data to our | ||
specifications. | ||
|
||
|
||
### Data extraction | ||
Command: | ||
`poetry run python google_semantic_location_history/main.py` | ||
|
||
This extracts and displays the relevant data and summary. It calls the | ||
same `process` function as the web application would. To use the GSLH | ||
data extraction script in the web application, one needs to specify this | ||
in [pyworker.js](../pyworker.js) by changing | ||
`data_extractor/__init__.py` into | ||
`google_semantic_location_history/__init__.py`. | ||
|
||
## Google Search History | ||
In this example, we first create a simulated Google Search History | ||
(GSH) DDP. Subsequently, we extract relevant information from the simulated DDP. | ||
|
||
### Data simulation | ||
Command: | ||
`poetry run python google_search_history/simulation_gsh.py` | ||
|
||
This will generate a dummy Google Takeout ZIP file containing a | ||
simulated BrowserHistory.json file, and stores it in `tests/data/takeout.zip`. | ||
|
||
The BrowserHistory.json file mainly describes what | ||
websites were visited by the user and when. To simulate (news) web | ||
pages, you can either base them on real URLs (see | ||
tests/data/urldata.csv) or create entirely fake ones using | ||
[Faker](<https://github.com/joke2k/faker>). The timestamp of each web | ||
visit is set to be before, during, or after a certain period (in this | ||
case, the Dutch COVID-19 related curfew), and is randomly generated | ||
using the [datetime](<https://docs.python.org/3/library/datetime.html>) | ||
and [random](<https://docs.python.org/3/library/random.html>) libraries. | ||
|
||
Note that, even though the script is seeded and will, therefore, always | ||
yield the same outcome, there are various options to adapt the output | ||
depending on your personal (research) goal. These options are: | ||
- *n*: integer, size of BrowserHistory.json (i.e. number of web visits). | ||
Default=1000, | ||
- *site_diff*: float, percentage of generated websites that should be 'news' | ||
sites. Default=0.15, | ||
- *time_diff*: float, minimal percentage of web searchers that were specifically | ||
made in the evening during the curfew period. Default=0.20, | ||
- *seed*: integer, sets seed. Default=0, | ||
- *fake*: boolean, determines if URLs are based on true URLs (False) or entirely | ||
fake (True). Default=False | ||
|
||
### Data extraction | ||
Command: | ||
`poetry run python google_search_history/main.py` | ||
|
||
After running this script, the relevant data (i.e., an overview of the | ||
number of visits of news vs. other websites before, during, and after | ||
the curfew, and the corresponding time of day of the visits) are | ||
extracted from the (simulated) takeout.ZIP and displayed in a dataframe | ||
together with a textual summary. It calls the same `process` function as | ||
the web application would. To use the GSH data extraction script in the | ||
web application, specify this in [pyworker.js](../pyworker.js) by changing | ||
`data_extractor/__init__.py` into | ||
`google_search_history/__init__.py`. | ||
|
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
""" Script to extract info from Google Browser History """ | ||
|
||
__version__ = '0.1.0' | ||
|
||
import json | ||
import re | ||
import zipfile | ||
from datetime import datetime | ||
import pandas as pd | ||
import pytz | ||
|
||
ZONE = pytz.timezone('Europe/Amsterdam') | ||
START = datetime(2021, 1, 23, 21).astimezone(ZONE) | ||
END = datetime(2021, 4, 28, 4, 30).astimezone(ZONE) | ||
TEXT = f""" | ||
With this research we want to investigate how our news consumption \ | ||
behavior has changed during/after the COVID-19 related curfew. \ | ||
To examine this, we looked at your Google Search History. | ||
First, we divided your browser history into three periods: | ||
- before the start of the curfew (i.e., pages visited before \ | ||
{START.strftime('%A %d-%m-%Y %H:%M:%S')}), | ||
- during the curfew (i.e., pages visited between \ | ||
{START.strftime('%A %d-%m-%Y %H:%M:%S')} and {END.strftime('%A %d-%m-%Y %H:%M:%S')}), and | ||
- post curfew (i.e., pages visited after {END.strftime('%A %d-%m-%Y %H:%M:%S')}). | ||
For each period, we counted how many times you visited a news \ | ||
website versus any another type of website (i.e., news/other). \ | ||
While counting, we also took the time of day \ | ||
(i.e., morning/afternoon/evening/night) into account. | ||
""" | ||
NEWSSITES = 'news.google.com|nieuws.nl|nos.nl|www.rtlnieuws.nl|nu.nl|\ | ||
at5.nl|ad.nl|bd.nl|telegraaf.nl|volkskrant.nl|parool.nl|\ | ||
metronieuws.nl|nd.nl|nrc.nl|rd.nl|trouw.nl' | ||
|
||
|
||
def _calculate(dates): | ||
"""Counts number of web searches per time unit (morning, afternoon, | ||
evening, night), per website-period combination | ||
Args: | ||
dates: dictionary, dates per website (news vs. other) per period | ||
(pre, during, post curfew) | ||
Returns: | ||
results: list, number of times websites are visted per unit of time | ||
""" | ||
results = [] | ||
for category in dates.keys(): | ||
sub = {'morning': 0, 'afternoon': 0, | ||
'evening': 0, 'night': 0} | ||
sub['Curfew'], sub['Website'] = category.split('_') | ||
for date in dates[category]: | ||
hour = date.hour | ||
if 0 <= hour < 6: | ||
sub['night'] += 1 | ||
elif 6 <= hour < 12: | ||
sub['morning'] += 1 | ||
elif 12 <= hour < 18: | ||
sub['afternoon'] += 1 | ||
elif 18 <= hour < 24: | ||
sub['evening'] += 1 | ||
results.append(sub) | ||
return results | ||
|
||
|
||
def _extract(data): | ||
"""Extracts relevant data from browser history: | ||
- number of times websites (news vs. other) are visited | ||
- at which specific periods (pre, during, and post curfew), | ||
- on which specific times a day (morning, afternoon, evening, night). | ||
Args: | ||
data: BrowserHistory.json file | ||
Returns: | ||
results: list, number of times websites are visted per unit of time, | ||
per moment | ||
earliest: datetime, earliest web search | ||
latest: datetime, latest web search | ||
""" | ||
# Count number of news vs. other websites per time period | ||
# (i.e., pre/during/after Dutch curfew) | ||
dates = {'before_news': [], 'during_news': [], 'post_news': [], | ||
'before_other': [], 'during_other': [], 'post_other': []} | ||
timestamps = [d["time_usec"] | ||
for d in data["Browser History"] if d["page_transition"].lower() != "reload"] | ||
earliest = datetime.fromtimestamp(min(timestamps)/1e6).astimezone(ZONE) | ||
latest = datetime.fromtimestamp(max(timestamps)/1e6).astimezone(ZONE) | ||
for data_unit in data["Browser History"]: | ||
if data_unit["page_transition"].lower() != "reload": | ||
time = datetime.fromtimestamp(data_unit["time_usec"]/1e6).astimezone(ZONE) | ||
if time < START and re.findall(NEWSSITES, data_unit["url"]): | ||
dates['before_news'].append(time) | ||
elif time > END and re.findall(NEWSSITES, data_unit["url"]): | ||
dates['post_news'].append(time) | ||
elif time < START and not re.findall(NEWSSITES, data_unit["url"]): | ||
dates['before_other'].append(time) | ||
elif time > END and not re.findall(NEWSSITES, data_unit["url"]): | ||
dates['post_other'].append(time) | ||
elif re.findall(NEWSSITES, data_unit["url"]): | ||
dates['during_news'].append(time) | ||
elif not re.findall(NEWSSITES, data_unit["url"]): | ||
dates['during_other'].append(time) | ||
# Calculate times visited per time unit | ||
# (i.e., morning, afternoon, evening, night) | ||
results = _calculate(dates) | ||
return results, earliest, latest | ||
|
||
|
||
def process(file_data): | ||
""" Opens BrowserHistory.json and return relevant data pre, during, | ||
and post Dutch curfew | ||
Args: | ||
file_data: Takeout zipfile | ||
Returns: | ||
summary: summary of read file(s), earliest and latest websearch | ||
data_frames: pd.dataframe, overview of news vs. other searches | ||
per moment per time unit | ||
""" | ||
# Read BrowserHistory.json | ||
with zipfile.ZipFile(file_data) as zfile: | ||
file_list = zfile.namelist() | ||
for name in file_list: | ||
if re.search('BrowserHistory.json', name): | ||
data = json.loads(zfile.read(name).decode("utf8")) | ||
# Extract pre/during/post website searches, | ||
# earliest webclick and latest webclick | ||
results, earliest, latest = _extract(data) | ||
# Make tidy dataframe of webclicks | ||
df_results = pd.melt(pd.json_normalize(results), ["Curfew", "Website"], | ||
var_name="Time", value_name="Searches") | ||
data_frame = df_results.sort_values( | ||
['Curfew', 'Website']).reset_index(drop=True) | ||
# Return output | ||
text = f"""{TEXT} | ||
read_files: BrowserHistory.json | ||
Your earliest web search was on {earliest.strftime('%A %d-%m-%Y at %H:%M:%S')}, | ||
The Dutch curfew took place between {START.strftime('%A %d-%m-%Y %H:%M:%S')} \ | ||
and {END.strftime('%A %d-%m-%Y %H:%M:%S')}, | ||
Your latest web search was on {latest.strftime('%A %d-%m-%Y at %H:%M:%S')}. | ||
""" | ||
return { | ||
"summary": text, | ||
"data_frames": [ | ||
data_frame | ||
] | ||
} |
Oops, something went wrong.