From e3d7471240676539678b518484c4d3be90dea201 Mon Sep 17 00:00:00 2001 From: Rantaharju Jarno Date: Tue, 5 Dec 2023 21:03:42 +0200 Subject: [PATCH] Include Google TakeOut example in input_formats Need to add activity example --- docs/input-formats.ipynb | 516 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 514 insertions(+), 2 deletions(-) diff --git a/docs/input-formats.ipynb b/docs/input-formats.ipynb index 082444cc..f52581ea 100644 --- a/docs/input-formats.ipynb +++ b/docs/input-formats.ipynb @@ -10,7 +10,9 @@ "In principle, Niimpy can deal with any files of any format - you only need to convert them to a DataFrame. Still, it is very useful to have some common formats, so we present two standard formats with default readers:\n", "\n", "* **CSV files** are very standard and normal to create and understand, but in order to deal with them everything must be loaded into memory.\n", - "* **sqlite3 databases**, which requires sqlite3 to read, but provides more power for filtering and automatic processing without reading everything into memory." + "* **sqlite3 databases**, which requires sqlite3 to read, but provides more power for filtering and automatic processing without reading everything into memory.\n", + "* **Google TakeOut** provides a large selection of data in different formats. We provide readers most commonly used data types.\n", + "* **MHealth** is a common format for health data." ] }, { @@ -111,6 +113,516 @@ "sqlite3 files are highly recommended as a data storage format, since many common exploration options can be done within the database itself without reading the whole data into memory or writing an iterator. However, the interface is more difficult to use. Niimpy (before 2021-07) used this as its primary interface, but since then this interface has been de-emphasized. You can read more in [the database section](database.html), but this is only recommended if you need efficiency when using massive amounts of data." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Google TakeOut\n", + "\n", + "Google takeout contains a many different types of data and new types are added as Google creates services or changes data storage methods. Readers are currently available for location data and activity data from the fit app. For other data types, the user needs to manually convert them into a Niimpy compatible Pandas DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
accuracysourcedeviceplaceidformfactorlatitudelongitudeinferred_latitudeinferred_longitudeactivity_typeactivity_inference_confidenceuser
timestamp
2016-08-12 19:29:43.821000+00:0025WIFI-577680260NaNNaN35.997488-78.922194NaNNaNNaNNaN9e9cff5a-93a0-11ee-addd-b0dcef010c43
2016-08-12 19:30:49.531000+00:0021WIFI-577680260NaNNaN35.997559-78.922504NaNNaNSTILL62.09e9cff5a-93a0-11ee-addd-b0dcef010c43
2016-08-12 19:31:49.531000+00:0021WIFI-577680260ChIJS_5Nmuz1jUYRGYf3QiiZco4PHONE35.997559-78.92250460.18713524.824478STILL62.09e9cff5a-93a0-11ee-addd-b0dcef010c43
2016-08-12 21:15:55.295000+00:001500CELL-577680260NaNNaN36.000870-78.923343NaNNaNON_FOOT54.09e9cff5a-93a0-11ee-addd-b0dcef010c43
2016-08-12 21:16:33+00:008GPS-577680260NaNNaN35.997250-78.923989NaNNaNNaNNaN9e9cff5a-93a0-11ee-addd-b0dcef010c43
\n", + "
" + ], + "text/plain": [ + " accuracy source device \\\n", + "timestamp \n", + "2016-08-12 19:29:43.821000+00:00 25 WIFI -577680260 \n", + "2016-08-12 19:30:49.531000+00:00 21 WIFI -577680260 \n", + "2016-08-12 19:31:49.531000+00:00 21 WIFI -577680260 \n", + "2016-08-12 21:15:55.295000+00:00 1500 CELL -577680260 \n", + "2016-08-12 21:16:33+00:00 8 GPS -577680260 \n", + "\n", + " placeid formfactor \\\n", + "timestamp \n", + "2016-08-12 19:29:43.821000+00:00 NaN NaN \n", + "2016-08-12 19:30:49.531000+00:00 NaN NaN \n", + "2016-08-12 19:31:49.531000+00:00 ChIJS_5Nmuz1jUYRGYf3QiiZco4 PHONE \n", + "2016-08-12 21:15:55.295000+00:00 NaN NaN \n", + "2016-08-12 21:16:33+00:00 NaN NaN \n", + "\n", + " latitude longitude inferred_latitude \\\n", + "timestamp \n", + "2016-08-12 19:29:43.821000+00:00 35.997488 -78.922194 NaN \n", + "2016-08-12 19:30:49.531000+00:00 35.997559 -78.922504 NaN \n", + "2016-08-12 19:31:49.531000+00:00 35.997559 -78.922504 60.187135 \n", + "2016-08-12 21:15:55.295000+00:00 36.000870 -78.923343 NaN \n", + "2016-08-12 21:16:33+00:00 35.997250 -78.923989 NaN \n", + "\n", + " inferred_longitude activity_type \\\n", + "timestamp \n", + "2016-08-12 19:29:43.821000+00:00 NaN NaN \n", + "2016-08-12 19:30:49.531000+00:00 NaN STILL \n", + "2016-08-12 19:31:49.531000+00:00 24.824478 STILL \n", + "2016-08-12 21:15:55.295000+00:00 NaN ON_FOOT \n", + "2016-08-12 21:16:33+00:00 NaN NaN \n", + "\n", + " activity_inference_confidence \\\n", + "timestamp \n", + "2016-08-12 19:29:43.821000+00:00 NaN \n", + "2016-08-12 19:30:49.531000+00:00 62.0 \n", + "2016-08-12 19:31:49.531000+00:00 62.0 \n", + "2016-08-12 21:15:55.295000+00:00 54.0 \n", + "2016-08-12 21:16:33+00:00 NaN \n", + "\n", + " user \n", + "timestamp \n", + "2016-08-12 19:29:43.821000+00:00 9e9cff5a-93a0-11ee-addd-b0dcef010c43 \n", + "2016-08-12 19:30:49.531000+00:00 9e9cff5a-93a0-11ee-addd-b0dcef010c43 \n", + "2016-08-12 19:31:49.531000+00:00 9e9cff5a-93a0-11ee-addd-b0dcef010c43 \n", + "2016-08-12 21:15:55.295000+00:00 9e9cff5a-93a0-11ee-addd-b0dcef010c43 \n", + "2016-08-12 21:16:33+00:00 9e9cff5a-93a0-11ee-addd-b0dcef010c43 " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import niimpy\n", + "import niimpy.config as config\n", + "import niimpy.preprocessing.location as nilo\n", + "\n", + "data = niimpy.reading.google_takeout.location_history(config.GOOGLE_TAKEOUT_PATH)\n", + "data = nilo.filter_location(\n", + " data,\n", + " latitude_column = \"latitude\",\n", + " longitude_column = \"longitude\",\n", + " remove_disabled=False, remove_network=False, remove_zeros=True\n", + ")\n", + "data.head()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
userdevicedist_totaln_binsspeed_averagespeed_variancespeed_maxvariancelog_variance
timestamp
2016-08-31 00:00:00+00:009e9cff5a-93a0-11ee-addd-b0dcef010c43-577680260822.2272186.02.01327715.40344410.7647520.000002-13.007195
\n", + "
" + ], + "text/plain": [ + " user device \\\n", + "timestamp \n", + "2016-08-31 00:00:00+00:00 9e9cff5a-93a0-11ee-addd-b0dcef010c43 -577680260 \n", + "\n", + " dist_total n_bins speed_average speed_variance \\\n", + "timestamp \n", + "2016-08-31 00:00:00+00:00 822.227218 6.0 2.013277 15.403444 \n", + "\n", + " speed_max variance log_variance \n", + "timestamp \n", + "2016-08-31 00:00:00+00:00 10.764752 0.000002 -13.007195 " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "nilo.location_distance_features(data, {\n", + " \"latitude_column\": \"latitude\",\n", + " \"longitude_column\": \"longitude\",\n", + "})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each subject Downloads their Google TakeOut data as a separate zip file. The Zipfile package, which is included in the Python standard, is convenient for reading the data files contained in the zip file. For example, one could read the location data with the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
latitudeE7longitudeE7accuracysourcedeviceTagtimestampactivitylocationMetadataplaceIdformFactorinferredLocationactiveWifiScan.accessPoints
0359974880-78922194325WIFI-5776802602016-08-12T19:29:43.821ZNaNNaNNaNNaNNaNNaN
1359975588-78922503621WIFI-5776802602016-08-12T19:30:49.531Z[{'activity': [{'type': 'STILL', 'confidence':...NaNNaNNaNNaNNaN
2359975588-78922503621WIFI-5776802602016-08-12T19:31:49.531Z[{'activity': [{'type': 'STILL', 'confidence':...[{'wifiScan': {'accessPoints': [{'mac': '12410...ChIJS_5Nmuz1jUYRGYf3QiiZco4PHONE[{'timestamp': '2023-11-21T10:40:35.320Z', 'la...[{'mac': '124103876652832', 'strength': -63, '...
3360008703-7892334331500CELL-5776802602016-08-12T21:15:55.295Z[{'activity': [{'type': 'ON_FOOT', 'confidence...NaNNaNNaNNaNNaN
4359972502-7892398948GPS-5776802602016-08-12T21:16:33ZNaNNaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " latitudeE7 longitudeE7 accuracy source deviceTag \\\n", + "0 359974880 -789221943 25 WIFI -577680260 \n", + "1 359975588 -789225036 21 WIFI -577680260 \n", + "2 359975588 -789225036 21 WIFI -577680260 \n", + "3 360008703 -789233433 1500 CELL -577680260 \n", + "4 359972502 -789239894 8 GPS -577680260 \n", + "\n", + " timestamp \\\n", + "0 2016-08-12T19:29:43.821Z \n", + "1 2016-08-12T19:30:49.531Z \n", + "2 2016-08-12T19:31:49.531Z \n", + "3 2016-08-12T21:15:55.295Z \n", + "4 2016-08-12T21:16:33Z \n", + "\n", + " activity \\\n", + "0 NaN \n", + "1 [{'activity': [{'type': 'STILL', 'confidence':... \n", + "2 [{'activity': [{'type': 'STILL', 'confidence':... \n", + "3 [{'activity': [{'type': 'ON_FOOT', 'confidence... \n", + "4 NaN \n", + "\n", + " locationMetadata \\\n", + "0 NaN \n", + "1 NaN \n", + "2 [{'wifiScan': {'accessPoints': [{'mac': '12410... \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " placeId formFactor \\\n", + "0 NaN NaN \n", + "1 NaN NaN \n", + "2 ChIJS_5Nmuz1jUYRGYf3QiiZco4 PHONE \n", + "3 NaN NaN \n", + "4 NaN NaN \n", + "\n", + " inferredLocation \\\n", + "0 NaN \n", + "1 NaN \n", + "2 [{'timestamp': '2023-11-21T10:40:35.320Z', 'la... \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " activeWifiScan.accessPoints \n", + "0 NaN \n", + "1 NaN \n", + "2 [{'mac': '124103876652832', 'strength': -63, '... \n", + "3 NaN \n", + "4 NaN " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from zipfile import ZipFile\n", + "import json\n", + "import pandas as pd\n", + "from niimpy import config\n", + "\n", + "zip_file = ZipFile(config.GOOGLE_TAKEOUT_PATH)\n", + "json_data = zip_file.read(\"Takeout/Location History/Records.json\")\n", + "json_data = json.loads(json_data)\n", + "data = pd.json_normalize(json_data[\"locations\"])\n", + "data = pd.DataFrame(data)\n", + "data.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Location data is stored in the json format. Other types of data are stored in various formats and with different files structures. The user must find how each type of data they need is stored and how it can be read in Python." + ] + }, { "attachments": {}, "cell_type": "markdown", @@ -138,7 +650,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.3" + "version": "3.11.4" } }, "nbformat": 4,