Skip to content

Commit

Permalink
Merge pull request #6 from UtrechtUniversity/master
Browse files Browse the repository at this point in the history
UU scripts
  • Loading branch information
vloothuis authored Oct 5, 2021
2 parents 2fbb793 + 67fe546 commit c5e6ad3
Show file tree
Hide file tree
Showing 25 changed files with 422,482 additions and 100 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/on_pull_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,12 @@ jobs:

- name: Pylint
working-directory: data_extractor
run: poetry run pylint google_semantic_location_history
run: poetry run pylint google_semantic_location_history google_search_history

- name: Flake8
working-directory: data_extractor
run: poetry run flake8 google_semantic_location_history
run: poetry run flake8 google_semantic_location_history google_search_history

- name: Pytest
working-directory: data_extractor
run: poetry run pytest -v --cov=google_semantic_location_history --cov=data_extractor --cov-fail-under=80 tests/
run: poetry run pytest -v --cov=google_semantic_location_history --cov=google_search_history --cov=data_extractor --cov-fail-under=80 tests/
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,12 @@

*.pyc
data_extractor/tests/data/BrowserHistory.json
data_extractor/.coverage
data_extractor/.ipynb_checkpoints/
data_extractor/google_search_history/POC.py
.coverage
.python-version
Lib/
Scripts/
.vscode/
pyvenv.cfg
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,29 @@
# Eyra Port POC

## Description
This proof-of-concept shows how Python can be used, from within a web-browser,
to extract data for research purposes. The end-user (data donator) needs only
to have a modern web-browser.

An application is provided which creates a basic user-interface for selecting a
file. This file is then submitted to a Python interperter. The relevant file
file. This file is then submitted to a Python interpreter. The relevant file
API's of the web-browser have been wrapped so that the Python code does not
need to be aware of running in a web-browser.

Example code for the Python part of the project is provided in the
`data_extractor` folder. This also contains a *README* file which explains how
to modify this code.
[data_extractor](data_extractor/) folder. This also contains a [README](data_extractor/README.md)
which explains how to modify this code and presents two examples of data extraction scripts,
namely for Google Semantic Location History and Google Search History data.

To run the application execute the following command from the checkout:

python3 -m http.server

This launches a web-server that can be access on
This launches a web-server that can be accessed on
[http://localhost:8000](http://localhost:8000).


## Contributors
Port POC is developed by [Eyra](https://github.com/eyra). The example data extraction scripts for
Google Semantic Location History and Google Search History data packages are developed by the
[Research Engineering team](https://github.com/orgs/UtrechtUniversity/teams/research-engineering) of Utrecht University.
15 changes: 15 additions & 0 deletions data_extractor/.coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# .coveragerc to control coverage.py

[report]
# Regexes for lines to exclude from consideration
exclude_lines =
# Have to re-enable the standard pragma
pragma: no cover

# Don't complain if non-runnable code isn't run:
if 0:
if __name__ == .__main__.:


[run]
omit = */main.py
5 changes: 4 additions & 1 deletion data_extractor/.pylintrc
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
[FORMAT]

# Maximum number of characters on a single line.
max-line-length=100
max-line-length=100

# Maximum number of locals for function / method body
max-locals=20
120 changes: 120 additions & 0 deletions data_extractor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Data Extractor

This is a basic template which can be used to write a data extractor.
The extraction logic should be placed in the `process` function within
`data_extractor/__init__.py`. An example has been provided.

The argument that the `process` function receives is a file-like object.
It can therefore be used with most regular Python libraries. The example
demonstrates this by usage of the `zipfile` module.

This project makes use of [Poetry](https://python-poetry.org/). It makes
creating the required Python Wheel a straight-forward process. Install
Poetry with the following command: `pip install poetry`.
Then easily install the required python packages with `poetry install`.

The behavior of the `process` function can be verified by running the
tests. The test are located in the `tests` folder. To run the tests
execute: `poetry run pytest`.

To run the extraction code from the browser run:
`python3 -m http.server` from the project root folder (the one with `.git`).
This will start a webserver on: [localhost](http://localhost:8000).
Opening a browser with that URL will initialize the application. After
it has been loaded a file can be selected. The output of the process
function will be displayed after a while (depending on the amount of
processing required and the speed of the machine).

# Examples

## Google Semantic Location History
In this example, we first create a simulated Google Semantic Location History
(GSLH) data download package (DDP). Subsequently, we extract relevant information
from the simulated DDP.

### Data simulation
Command:
`poetry run python google_semantic_location_history/simulation_gslh.py`

This creates a zipfile with the simulated Google Semantic Location
History data in `tests/data/Location History.zip`.

GSLH data is simulated using the python libraries
[GenSON](<https://pypi.org/project/genson/>),
[Faker](<https://github.com/joke2k/faker>), and
[faker-schema](<https://pypi.org/project/faker-schema/>).
The simulation script first generates a JSON schema from a JSON object using
`GenSON's SchemaBuilder` class. The JSON object is derived from an
example GSLH data package, downloaded from Google Takeout. The GSLH data
package consists of monthly JSON files with information on e.g.
geolocations, addresses and time spend in places and in activity. The
JSON schema describes the format of the JSON files, and can be used to
generate fake data with the same format. This is done by converting the
JSON schema to a custom schema expected by `faker-schema` in the form of
a dictionary, where the keys are field names and the values are the
types of the fields. The values represent available data types in Faker,
packed in so-called providers. `Faker` provides a wide variety of data
types via providers, for example for names, addresses, and geographical
data. This allows us to easily customize the faked data to our
specifications.


### Data extraction
Command:
`poetry run python google_semantic_location_history/main.py`

This extracts and displays the relevant data and summary. It calls the
same `process` function as the web application would. To use the GSLH
data extraction script in the web application, one needs to specify this
in [pyworker.js](../pyworker.js) by changing
`data_extractor/__init__.py` into
`google_semantic_location_history/__init__.py`.

## Google Search History
In this example, we first create a simulated Google Search History
(GSH) DDP. Subsequently, we extract relevant information from the simulated DDP.

### Data simulation
Command:
`poetry run python google_search_history/simulation_gsh.py`

This will generate a dummy Google Takeout ZIP file containing a
simulated BrowserHistory.json file, and stores it in `tests/data/takeout.zip`.

The BrowserHistory.json file mainly describes what
websites were visited by the user and when. To simulate (news) web
pages, you can either base them on real URLs (see
tests/data/urldata.csv) or create entirely fake ones using
[Faker](<https://github.com/joke2k/faker>). The timestamp of each web
visit is set to be before, during, or after a certain period (in this
case, the Dutch COVID-19 related curfew), and is randomly generated
using the [datetime](<https://docs.python.org/3/library/datetime.html>)
and [random](<https://docs.python.org/3/library/random.html>) libraries.

Note that, even though the script is seeded and will, therefore, always
yield the same outcome, there are various options to adapt the output
depending on your personal (research) goal. These options are:
- *n*: integer, size of BrowserHistory.json (i.e. number of web visits).
Default=1000,
- *site_diff*: float, percentage of generated websites that should be 'news'
sites. Default=0.15,
- *time_diff*: float, minimal percentage of web searchers that were specifically
made in the evening during the curfew period. Default=0.20,
- *seed*: integer, sets seed. Default=0,
- *fake*: boolean, determines if URLs are based on true URLs (False) or entirely
fake (True). Default=False

### Data extraction
Command:
`poetry run python google_search_history/main.py`

After running this script, the relevant data (i.e., an overview of the
number of visits of news vs. other websites before, during, and after
the curfew, and the corresponding time of day of the visits) are
extracted from the (simulated) takeout.ZIP and displayed in a dataframe
together with a textual summary. It calls the same `process` function as
the web application would. To use the GSH data extraction script in the
web application, specify this in [pyworker.js](../pyworker.js) by changing
`data_extractor/__init__.py` into
`google_search_history/__init__.py`.

31 changes: 0 additions & 31 deletions data_extractor/README.rst

This file was deleted.

142 changes: 142 additions & 0 deletions data_extractor/google_search_history/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
""" Script to extract info from Google Browser History """

__version__ = '0.1.0'

import json
import re
import zipfile
from datetime import datetime
import pandas as pd
import pytz

ZONE = pytz.timezone('Europe/Amsterdam')
START = datetime(2021, 1, 23, 21).astimezone(ZONE)
END = datetime(2021, 4, 28, 4, 30).astimezone(ZONE)
TEXT = f"""
With this research we want to investigate how our news consumption \
behavior has changed during/after the COVID-19 related curfew. \
To examine this, we looked at your Google Search History.
First, we divided your browser history into three periods:
- before the start of the curfew (i.e., pages visited before \
{START.strftime('%A %d-%m-%Y %H:%M:%S')}),
- during the curfew (i.e., pages visited between \
{START.strftime('%A %d-%m-%Y %H:%M:%S')} and {END.strftime('%A %d-%m-%Y %H:%M:%S')}), and
- post curfew (i.e., pages visited after {END.strftime('%A %d-%m-%Y %H:%M:%S')}).
For each period, we counted how many times you visited a news \
website versus any another type of website (i.e., news/other). \
While counting, we also took the time of day \
(i.e., morning/afternoon/evening/night) into account.
"""
NEWSSITES = 'news.google.com|nieuws.nl|nos.nl|www.rtlnieuws.nl|nu.nl|\
at5.nl|ad.nl|bd.nl|telegraaf.nl|volkskrant.nl|parool.nl|\
metronieuws.nl|nd.nl|nrc.nl|rd.nl|trouw.nl'


def _calculate(dates):
"""Counts number of web searches per time unit (morning, afternoon,
evening, night), per website-period combination
Args:
dates: dictionary, dates per website (news vs. other) per period
(pre, during, post curfew)
Returns:
results: list, number of times websites are visted per unit of time
"""
results = []
for category in dates.keys():
sub = {'morning': 0, 'afternoon': 0,
'evening': 0, 'night': 0}
sub['Curfew'], sub['Website'] = category.split('_')
for date in dates[category]:
hour = date.hour
if 0 <= hour < 6:
sub['night'] += 1
elif 6 <= hour < 12:
sub['morning'] += 1
elif 12 <= hour < 18:
sub['afternoon'] += 1
elif 18 <= hour < 24:
sub['evening'] += 1
results.append(sub)
return results


def _extract(data):
"""Extracts relevant data from browser history:
- number of times websites (news vs. other) are visited
- at which specific periods (pre, during, and post curfew),
- on which specific times a day (morning, afternoon, evening, night).
Args:
data: BrowserHistory.json file
Returns:
results: list, number of times websites are visted per unit of time,
per moment
earliest: datetime, earliest web search
latest: datetime, latest web search
"""
# Count number of news vs. other websites per time period
# (i.e., pre/during/after Dutch curfew)
dates = {'before_news': [], 'during_news': [], 'post_news': [],
'before_other': [], 'during_other': [], 'post_other': []}
timestamps = [d["time_usec"]
for d in data["Browser History"] if d["page_transition"].lower() != "reload"]
earliest = datetime.fromtimestamp(min(timestamps)/1e6).astimezone(ZONE)
latest = datetime.fromtimestamp(max(timestamps)/1e6).astimezone(ZONE)
for data_unit in data["Browser History"]:
if data_unit["page_transition"].lower() != "reload":
time = datetime.fromtimestamp(data_unit["time_usec"]/1e6).astimezone(ZONE)
if time < START and re.findall(NEWSSITES, data_unit["url"]):
dates['before_news'].append(time)
elif time > END and re.findall(NEWSSITES, data_unit["url"]):
dates['post_news'].append(time)
elif time < START and not re.findall(NEWSSITES, data_unit["url"]):
dates['before_other'].append(time)
elif time > END and not re.findall(NEWSSITES, data_unit["url"]):
dates['post_other'].append(time)
elif re.findall(NEWSSITES, data_unit["url"]):
dates['during_news'].append(time)
elif not re.findall(NEWSSITES, data_unit["url"]):
dates['during_other'].append(time)
# Calculate times visited per time unit
# (i.e., morning, afternoon, evening, night)
results = _calculate(dates)
return results, earliest, latest


def process(file_data):
""" Opens BrowserHistory.json and return relevant data pre, during,
and post Dutch curfew
Args:
file_data: Takeout zipfile
Returns:
summary: summary of read file(s), earliest and latest websearch
data_frames: pd.dataframe, overview of news vs. other searches
per moment per time unit
"""
# Read BrowserHistory.json
with zipfile.ZipFile(file_data) as zfile:
file_list = zfile.namelist()
for name in file_list:
if re.search('BrowserHistory.json', name):
data = json.loads(zfile.read(name).decode("utf8"))
# Extract pre/during/post website searches,
# earliest webclick and latest webclick
results, earliest, latest = _extract(data)
# Make tidy dataframe of webclicks
df_results = pd.melt(pd.json_normalize(results), ["Curfew", "Website"],
var_name="Time", value_name="Searches")
data_frame = df_results.sort_values(
['Curfew', 'Website']).reset_index(drop=True)
# Return output
text = f"""{TEXT}
read_files: BrowserHistory.json
Your earliest web search was on {earliest.strftime('%A %d-%m-%Y at %H:%M:%S')},
The Dutch curfew took place between {START.strftime('%A %d-%m-%Y %H:%M:%S')} \
and {END.strftime('%A %d-%m-%Y %H:%M:%S')},
Your latest web search was on {latest.strftime('%A %d-%m-%Y at %H:%M:%S')}.
"""
return {
"summary": text,
"data_frames": [
data_frame
]
}
Loading

0 comments on commit c5e6ad3

Please sign in to comment.