Merge pull request #6 from UtrechtUniversity/master

UU scripts
eyra · Oct 5, 2021 · c5e6ad3 · c5e6ad3
2 parents 2fbb793 + 67fe546
commit c5e6ad3
Show file tree

Hide file tree

Showing 25 changed files with 422,482 additions and 100 deletions.
diff --git a/.github/workflows/on_pull_request.yml b/.github/workflows/on_pull_request.yml
@@ -33,12 +33,12 @@ jobs:
 
       - name: Pylint
         working-directory: data_extractor
-        run: poetry run pylint google_semantic_location_history
+        run: poetry run pylint google_semantic_location_history google_search_history
 
       - name: Flake8
         working-directory: data_extractor
-        run: poetry run flake8 google_semantic_location_history
+        run: poetry run flake8 google_semantic_location_history google_search_history
 
       - name: Pytest
         working-directory: data_extractor      
-        run: poetry run pytest -v --cov=google_semantic_location_history --cov=data_extractor --cov-fail-under=80 tests/
+        run: poetry run pytest -v --cov=google_semantic_location_history --cov=google_search_history --cov=data_extractor --cov-fail-under=80 tests/
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,12 @@
 
 *.pyc
+data_extractor/tests/data/BrowserHistory.json
+data_extractor/.coverage
+data_extractor/.ipynb_checkpoints/
+data_extractor/google_search_history/POC.py
+.coverage
+.python-version
+Lib/
+Scripts/
+.vscode/
+pyvenv.cfg
diff --git a/README.md b/README.md
@@ -1,21 +1,29 @@
 # Eyra Port POC
 
+## Description
 This proof-of-concept shows how Python can be used, from within a web-browser,
 to extract data for research purposes. The end-user (data donator) needs only
 to have a modern web-browser.
 
 An application is provided which creates a basic user-interface for selecting a
-file. This file is then submitted to a Python interperter. The relevant file
+file. This file is then submitted to a Python interpreter. The relevant file
 API's of the web-browser have been wrapped so that the Python code does not
 need to be aware of running in a web-browser.
 
 Example code for the Python part of the project is provided in the
-`data_extractor` folder. This also contains a *README* file which explains how
-to modify this code.
+[data_extractor](data_extractor/) folder. This also contains a [README](data_extractor/README.md) 
+which explains how to modify this code and presents two examples of data extraction scripts,
+namely for Google Semantic Location History and Google Search History data.
 
 To run the application execute the following command from the checkout:
 
 	python3 -m http.server
 
-This launches a web-server that can be access on 
+This launches a web-server that can be accessed on 
 [http://localhost:8000](http://localhost:8000).
+
+
+## Contributors
+Port POC is developed by [Eyra](https://github.com/eyra). The example data extraction scripts for 
+Google Semantic Location History and Google Search History data packages are developed by the 
+[Research Engineering team](https://github.com/orgs/UtrechtUniversity/teams/research-engineering) of Utrecht University.
diff --git a/data_extractor/.coveragerc b/data_extractor/.coveragerc
@@ -0,0 +1,15 @@
+# .coveragerc to control coverage.py
+
+[report]
+# Regexes for lines to exclude from consideration
+exclude_lines =
+    # Have to re-enable the standard pragma
+    pragma: no cover
+
+    # Don't complain if non-runnable code isn't run:
+    if 0:
+    if __name__ == .__main__.:
+
+
+[run]
+omit = */main.py
diff --git a/data_extractor/.pylintrc b/data_extractor/.pylintrc
@@ -1,4 +1,7 @@
 [FORMAT]
 
 # Maximum number of characters on a single line.
-max-line-length=100
+max-line-length=100
+
+# Maximum number of locals for function / method body
+max-locals=20
diff --git a/data_extractor/README.md b/data_extractor/README.md
@@ -0,0 +1,120 @@
+# Data Extractor
+
+This is a basic template which can be used to write a data extractor.
+The extraction logic should be placed in the `process` function within
+`data_extractor/__init__.py`. An example has been provided.
+
+The argument that the `process` function receives is a file-like object.
+It can therefore be used with most regular Python libraries. The example
+demonstrates this by usage of the `zipfile` module.
+
+This project makes use of [Poetry](https://python-poetry.org/). It makes
+creating the required Python Wheel a straight-forward process. Install
+Poetry with the following command: `pip install poetry`. 
+Then easily install the required python packages with `poetry install`.
+
+The behavior of the `process` function can be verified by running the
+tests. The test are located in the `tests` folder. To run the tests
+execute: `poetry run pytest`.
+
+To run the extraction code from the browser run:
+`python3 -m http.server` from the project root folder (the one with `.git`).
+This will start a webserver on: [localhost](http://localhost:8000).
+Opening a browser with that URL will initialize the application. After
+it has been loaded a file can be selected. The output of the process
+function will be displayed after a while (depending on the amount of
+processing required and the speed of the machine).
+
+# Examples
+
+## Google Semantic Location History 
+In this example, we first create a simulated Google Semantic Location History
+(GSLH) data download package (DDP). Subsequently, we extract relevant information
+from the simulated DDP.
+
+### Data simulation
+Command:
+`poetry run python google_semantic_location_history/simulation_gslh.py`
+
+This creates a zipfile with the simulated Google Semantic Location
+History data in `tests/data/Location History.zip`.
+
+GSLH data is simulated using the python libraries
+[GenSON](<https://pypi.org/project/genson/>),
+[Faker](<https://github.com/joke2k/faker>), and
+[faker-schema](<https://pypi.org/project/faker-schema/>).
+The simulation script first generates a JSON schema from a JSON object using
+`GenSON's SchemaBuilder` class. The JSON object is derived from an
+example GSLH data package, downloaded from Google Takeout. The GSLH data
+package consists of monthly JSON files with information on e.g.
+geolocations, addresses and time spend in places and in activity. The
+JSON schema describes the format of the JSON files, and can be used to
+generate fake data with the same format. This is done by converting the
+JSON schema to a custom schema expected by `faker-schema` in the form of
+a dictionary, where the keys are field names and the values are the
+types of the fields. The values represent available data types in Faker,
+packed in so-called providers. `Faker` provides a wide variety of data
+types via providers, for example for names, addresses, and geographical
+data. This allows us to easily customize the faked data to our
+specifications.
+
+
+### Data extraction
+Command:
+`poetry run python google_semantic_location_history/main.py`
+
+This extracts and displays the relevant data and summary. It calls the
+same `process` function as the web application would. To use the GSLH
+data extraction script in the web application, one needs to specify this
+in [pyworker.js](../pyworker.js) by changing
+ `data_extractor/__init__.py` into 
+ `google_semantic_location_history/__init__.py`.
+
+## Google Search History
+In this example, we first create a simulated Google Search History
+(GSH) DDP. Subsequently, we extract relevant information from the simulated DDP.
+
+### Data simulation
+Command:
+`poetry run python google_search_history/simulation_gsh.py`
+
+This will generate a dummy Google Takeout ZIP file containing a
+simulated BrowserHistory.json file, and stores it in `tests/data/takeout.zip`.
+
+The BrowserHistory.json file mainly describes what 
+websites were visited by the user and when. To simulate (news) web
+pages, you can either base them on real URLs (see
+tests/data/urldata.csv) or create entirely fake ones using
+[Faker](<https://github.com/joke2k/faker>). The timestamp of each web
+visit is set to be before, during, or after a certain period (in this
+case, the Dutch COVID-19 related curfew), and is randomly generated
+using the [datetime](<https://docs.python.org/3/library/datetime.html>)
+and [random](<https://docs.python.org/3/library/random.html>) libraries.
+
+Note that, even though the script is seeded and will, therefore, always
+yield the same outcome, there are various options to adapt the output
+depending on your personal (research) goal. These options are: 
+- *n*: integer, size of BrowserHistory.json (i.e. number of web visits).
+Default=1000, 
+- *site_diff*: float, percentage of generated websites that should be 'news' 
+sites. Default=0.15, 
+- *time_diff*: float, minimal percentage of web searchers that were specifically 
+made in the evening during the curfew period. Default=0.20, 
+- *seed*: integer, sets seed. Default=0, 
+- *fake*: boolean, determines if URLs are based on true URLs (False) or entirely 
+fake (True). Default=False
+
+### Data extraction
+Command:
+`poetry run python google_search_history/main.py`
+
+After running this script, the relevant data (i.e., an overview of the
+number of visits of news vs. other websites before, during, and after
+the curfew, and the corresponding time of day of the visits) are
+extracted from the (simulated) takeout.ZIP and displayed in a dataframe
+together with a textual summary. It calls the same `process` function as
+the web application would. To use the GSH data extraction script in the
+web application, specify this in [pyworker.js](../pyworker.js) by changing
+ `data_extractor/__init__.py` into 
+ `google_search_history/__init__.py`.
+
diff --git a/data_extractor/README.rst b/data_extractor/README.rst
diff --git a/data_extractor/google_search_history/__init__.py b/data_extractor/google_search_history/__init__.py
@@ -0,0 +1,142 @@
+""" Script to extract info from Google Browser History """
+
+__version__ = '0.1.0'
+
+import json
+import re
+import zipfile
+from datetime import datetime
+import pandas as pd
+import pytz
+
+ZONE = pytz.timezone('Europe/Amsterdam')
+START = datetime(2021, 1, 23, 21).astimezone(ZONE)
+END = datetime(2021, 4, 28, 4, 30).astimezone(ZONE)
+TEXT = f"""
+With this research we want to investigate how our news consumption \
+behavior has changed during/after the COVID-19 related curfew. \
+To examine this, we looked at your Google Search History.
+First, we divided your browser history into three periods:
+- before the start of the curfew (i.e., pages visited before \
+{START.strftime('%A %d-%m-%Y %H:%M:%S')}),
+- during the curfew (i.e., pages visited between \
+{START.strftime('%A %d-%m-%Y %H:%M:%S')} and {END.strftime('%A %d-%m-%Y %H:%M:%S')}), and
+- post curfew (i.e., pages visited after {END.strftime('%A %d-%m-%Y %H:%M:%S')}).
+For each period, we counted how many times you visited a news \
+website versus any another type of website (i.e., news/other). \
+While counting, we also took the time of day \
+(i.e., morning/afternoon/evening/night) into account.
+"""
+NEWSSITES = 'news.google.com|nieuws.nl|nos.nl|www.rtlnieuws.nl|nu.nl|\
+    at5.nl|ad.nl|bd.nl|telegraaf.nl|volkskrant.nl|parool.nl|\
+    metronieuws.nl|nd.nl|nrc.nl|rd.nl|trouw.nl'
+
+
+def _calculate(dates):
+    """Counts number of web searches per time unit (morning, afternoon,
+        evening, night), per website-period combination
+    Args:
+        dates: dictionary, dates per website (news vs. other) per period
+            (pre, during, post curfew)
+    Returns:
+        results: list, number of times websites are visted per unit of time
+    """
+    results = []
+    for category in dates.keys():
+        sub = {'morning': 0, 'afternoon': 0,
+               'evening': 0, 'night': 0}
+        sub['Curfew'], sub['Website'] = category.split('_')
+        for date in dates[category]:
+            hour = date.hour
+            if 0 <= hour < 6:
+                sub['night'] += 1
+            elif 6 <= hour < 12:
+                sub['morning'] += 1
+            elif 12 <= hour < 18:
+                sub['afternoon'] += 1
+            elif 18 <= hour < 24:
+                sub['evening'] += 1
+        results.append(sub)
+    return results
+
+
+def _extract(data):
+    """Extracts relevant data from browser history:
+        - number of times websites (news vs. other) are visited
+        - at which specific periods (pre, during, and post curfew),
+        - on which specific times a day (morning, afternoon, evening, night).
+    Args:
+        data: BrowserHistory.json file
+    Returns:
+        results: list, number of times websites are visted per unit of time,
+            per moment
+        earliest: datetime, earliest web search
+        latest: datetime, latest web search
+    """
+    # Count number of news vs. other websites per time period
+    # (i.e., pre/during/after Dutch curfew)
+    dates = {'before_news': [], 'during_news': [], 'post_news': [],
+             'before_other': [], 'during_other': [], 'post_other': []}
+    timestamps = [d["time_usec"]
+                  for d in data["Browser History"] if d["page_transition"].lower() != "reload"]
+    earliest = datetime.fromtimestamp(min(timestamps)/1e6).astimezone(ZONE)
+    latest = datetime.fromtimestamp(max(timestamps)/1e6).astimezone(ZONE)
+    for data_unit in data["Browser History"]:
+        if data_unit["page_transition"].lower() != "reload":
+            time = datetime.fromtimestamp(data_unit["time_usec"]/1e6).astimezone(ZONE)
+            if time < START and re.findall(NEWSSITES, data_unit["url"]):
+                dates['before_news'].append(time)
+            elif time > END and re.findall(NEWSSITES, data_unit["url"]):
+                dates['post_news'].append(time)
+            elif time < START and not re.findall(NEWSSITES, data_unit["url"]):
+                dates['before_other'].append(time)
+            elif time > END and not re.findall(NEWSSITES, data_unit["url"]):
+                dates['post_other'].append(time)
+            elif re.findall(NEWSSITES, data_unit["url"]):
+                dates['during_news'].append(time)
+            elif not re.findall(NEWSSITES, data_unit["url"]):
+                dates['during_other'].append(time)
+    # Calculate times visited per time unit
+    # (i.e., morning, afternoon, evening, night)
+    results = _calculate(dates)
+    return results, earliest, latest
+
+
+def process(file_data):
+    """ Opens BrowserHistory.json and return relevant data pre, during,
+        and post Dutch curfew
+    Args:
+        file_data: Takeout zipfile
+    Returns:
+        summary: summary of read file(s), earliest and latest websearch
+        data_frames: pd.dataframe, overview of news vs. other searches
+            per moment per time unit
+    """
+    # Read BrowserHistory.json
+    with zipfile.ZipFile(file_data) as zfile:
+        file_list = zfile.namelist()
+        for name in file_list:
+            if re.search('BrowserHistory.json', name):
+                data = json.loads(zfile.read(name).decode("utf8"))
+    # Extract pre/during/post website searches,
+    # earliest webclick and latest webclick
+    results, earliest, latest = _extract(data)
+    # Make tidy dataframe of webclicks
+    df_results = pd.melt(pd.json_normalize(results), ["Curfew", "Website"],
+                         var_name="Time", value_name="Searches")
+    data_frame = df_results.sort_values(
+        ['Curfew', 'Website']).reset_index(drop=True)
+    # Return output
+    text = f"""{TEXT}
+    read_files: BrowserHistory.json
+    Your earliest web search was on {earliest.strftime('%A %d-%m-%Y at %H:%M:%S')},
+    The Dutch curfew took place between {START.strftime('%A %d-%m-%Y %H:%M:%S')} \
+    and {END.strftime('%A %d-%m-%Y %H:%M:%S')},
+    Your latest web search was on {latest.strftime('%A %d-%m-%Y at %H:%M:%S')}.
+    """
+    return {
+        "summary": text,
+        "data_frames": [
+            data_frame
+        ]
+    }