Skip to content

Commit

Permalink
Merge pull request #1 from datacommonsorg/master
Browse files Browse the repository at this point in the history
Update forked repo
  • Loading branch information
clincoln8 authored Aug 3, 2020
2 parents 88ef6bc + 03026df commit 3f04b84
Show file tree
Hide file tree
Showing 101 changed files with 10,097 additions and 492 deletions.
8 changes: 4 additions & 4 deletions cloudbuild.py.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ steps:
entrypoint: python3
args: ["-m", "unittest", "discover", "-v", "-s", "util/", "-p", "*_test.py"]
# In cloudbuild, everything happens under /workspace path
env: ["PYTHONPATH=/workspace"]
env: ["PYTHONPATH=/workspace:/workspace/scripts"]
waitFor:
- python_install

Expand All @@ -27,7 +27,7 @@ steps:
entrypoint: python3
args: ["-m", "unittest", "discover", "-v", "-s", "scripts/", "-p", "*_test.py"]
# In cloudbuild, everything happens under /workspace path
env: ["PYTHONPATH=/workspace"]
env: ["PYTHONPATH=/workspace:/workspace/scripts"]
waitFor:
- python_install

Expand All @@ -36,7 +36,7 @@ steps:
entrypoint: python3
args: ["-m", "unittest", "discover", "-v", "-s", "import-automation/", "-p", "*_test.py"]
# In cloudbuild, everything happens under /workspace path
env: ["PYTHONPATH=/workspace"]
env: ["PYTHONPATH=/workspace:/workspace/scripts"]
waitFor:
- python_install

Expand All @@ -45,7 +45,7 @@ steps:
entrypoint: python3
args: ["-m", "yapf", "--recursive", "--diff", "--style", "google", "util/", "scripts/", "tools/", "docs/", "schema/"]
# In cloudbuild, everything happens under /workspace path
env: ["PYTHONPATH=/workspace"]
env: ["PYTHONPATH=/workspace:/workspace/scripts"]
waitFor:
- python_install

Expand Down
3 changes: 3 additions & 0 deletions docs/mcf_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,9 @@ containedInPlace: geoId/06
letter is lower-cased.
[(EXAMPLE)](https://browser.datacommons.org/kg?dcid=healthOutcome)

- DCIDs have a length limit of 256 characters. Nodes with a DCID longer than
256 characters will throw a syntax error upon uploading to DataCommons.

## MCF Types in Data Commons

We employ two main types of MCF nodes: instance MCF and template MCF. We may
Expand Down
142 changes: 142 additions & 0 deletions import-automation/executor/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
.idea
.DS_Store

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

5 changes: 0 additions & 5 deletions import-automation/executor/README.md

This file was deleted.

11 changes: 11 additions & 0 deletions import-automation/executor/app.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
runtime: python37
entrypoint: gunicorn --timeout 600 -b :$PORT app.main:FLASK_APP
env_variables:
EXECUTOR_PRODUCTION: "True"
TMPDIR: "/tmp"
basic_scaling:
max_instances: 5
idle_timeout: 10m
# 2048 MB RAM and 4.8 GHz CPU.
# See https://cloud.google.com/appengine/docs/standard#instance_classes.
instance_class: B8
21 changes: 21 additions & 0 deletions import-automation/executor/app/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Entry point when run as a module.
See https://docs.python.org/3/library/__main__.html.
"""

from app import main

main.main()
102 changes: 102 additions & 0 deletions import-automation/executor/app/configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Configurations for the executor.
The app endpoints accept a configs field that allows customization of all the
configurations. See main.py.
"""

import os
import typing
import dataclasses

from google.cloud import logging


def _production():
return 'EXECUTOR_PRODUCTION' in os.environ


@dataclasses.dataclass
class ExecutorConfig:
"""Configurations for the executor."""
# ID of the Google Cloud project that hosts the executor. The project
# needs to enable App Engine, Cloud Storage, and Cloud Scheduler.
gcp_project_id: str = 'google.com:datcom-data'
# Oauth Client ID used to authenticate with the import progress dashboard.
# which is protected by Identity-Aware Proxy. This can be found by going
# to the Identity-Aware Proxy of the Google Cloud project that hosts
# the dashboard and clicking 'Edit OAuth Client'.
dashboard_oauth_client_id: str = ''
# Access token of the account used to authenticate with GitHub. This is not
# the account password. See
# https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token.
github_auth_access_token: str = ''
# Username of the account used to authenticate with GitHub.
github_auth_username: str = 'intrepiditee'
# Name of the repository that contains all the imports.
# On commits, this is the repository to which the pull requests must be sent
# to trigger the executor. The source repositories of the pull requests
# are downloaded.
# On updates, this is the repository that is downloaded.
github_repo_name: str = 'data'
# Username of the owner of the repository that contains all the imports.
github_repo_owner_username: str = 'datacommonsorg'
# Manifest filename. The executor uses this name to identify manifests.
manifest_filename: str = 'manifest.json'
# Python module requirements filename. The executor installs the modules
# listed in these files before running the user scripts.
requirements_filename: str = 'requirements.txt'
# Name of the Cloud Storage bucket to store the generated data files.
storage_bucket_name: str = 'import-inputs'
# Name of the file that specifies the most recently generated data files
# of an import. These files are stored in the bucket at a level higher
# than the data files. For example,
# import-inputs
# -- scripts
# ---- us_fed
# ------ treasury
# -------- latest_version.txt
# -------- 2020_07_15T12_07_17_365264_07_00
# ---------- data.csv
# -------- 2020_07_14T12_07_12_552234_07_00
# ---------- data.csv
# The content of latest_version.txt would be a single line of
# '2020_07_15T12_07_17_365264_07_00'.
storage_version_filename: str = 'latest_version.txt'
# Types of inputs accepted by the Data Commons importer. These are
# also the accepted fields of an import_inputs value in the manifest.
import_input_types: typing.List[str] = ('template_mcf', 'cleaned_csv',
'node_mcf')
# ID of the location where Cloud Scheduler is hosted.
scheduler_location: str = 'us-central1'
# Maximum time a user script can run for in seconds.
user_script_timeout: float = 600
# Maximum time venv creation can take in seconds.
venv_create_timeout: float = 600
# Maximum time downloading a file can take in seconds.
file_download_timeout: float = 600
# Maximum time downloading the repo can take in seconds.
repo_download_timeout: float = 600


def _setup_logging():
client = logging.Client(project=ExecutorConfig.gcp_project_id)
client.get_default_handler()
client.setup_logging()


if _production():
_setup_logging()
Loading

0 comments on commit 3f04b84

Please sign in to comment.