Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sonoma Data Scraper #57

Merged
merged 70 commits into from
Aug 18, 2020
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
10f0dfe
organization Merge CDM readme into readme
May 2, 2020
e5bceba
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 2, 2020
7db78f7
organization Move data models to own folder
May 2, 2020
05dbee7
organization Replace tabs with spaces
May 7, 2020
998800b
sonoma Get top level metadata
May 7, 2020
3fcb13f
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 9, 2020
ed855bd
sonoma Move scraper and collect metadata
May 10, 2020
fdab8a4
sonoma Add transmission types
May 10, 2020
8bd2081
sonoma Get cases, active, recovered, and death series
May 12, 2020
bd72db8
sonoma Get case data by age
May 12, 2020
ee5a8b7
sonoma Fix table numbers
May 16, 2020
7745e5b
sonoma Add test getter
May 16, 2020
e7ab26f
sonoma Factor out some common code
May 16, 2020
dc9b9fe
sonoma Add cases by race
May 16, 2020
af8bfe2
sonoma Add hospitalizations
May 17, 2020
adbe419
sonoma Add hospitalizations by gender
May 17, 2020
6b71193
sonoma Fix type error
May 17, 2020
627e82a
sonoma Redo definitions getter
May 17, 2020
a565a83
sonoma Add get_county function
May 17, 2020
358a441
sonoma Add docstrings
May 17, 2020
7dc3beb
sonoma Comment out hospitalizations by gender
May 17, 2020
6a4ead9
sonoma Add docstring for gender hospitalization
May 17, 2020
336e5ac
sonoma Remove unused variable
May 17, 2020
5297eeb
sonoma Replace findAll with find_all
May 19, 2020
5093fe3
sonoma Make newlines clearer
May 19, 2020
48dd3c1
sonoma Comment out hospitalizations
May 21, 2020
2a76315
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 21, 2020
a8ce742
sonoma Use better date parser
May 21, 2020
c9f3500
sonoma Improve transform cases function
May 21, 2020
058a555
sonoma Fix date formats, table selection, and number parsing
May 21, 2020
fd4e135
sonoma Use custom int parse function
May 21, 2020
8310ca0
sonoma Create custom FormatError exception
May 21, 2020
148eec8
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 23, 2020
ada9b2a
sonoma use template defaults for race
May 23, 2020
40b84e0
sonoma Fix test breakage
May 23, 2020
eb1a489
sonoma Use unique functions for age and gender
May 23, 2020
fdb2045
sonoma Transform age group names
May 23, 2020
1e1b0a8
sonoma Add error handling for gender and age transformations
May 23, 2020
1f3755a
sonoma Rename scraper file
May 23, 2020
96b81b5
sonoma Fix error handling for age
May 23, 2020
fb339b4
sonoma Fix typing errors
May 23, 2020
5d96031
sonoma Factor out getting section by title
May 28, 2020
fd09e5e
sonoma Correct deaths and cases aggregation
May 28, 2020
7770c89
sonoma Raise error for hospitalization change
May 28, 2020
6b4b69b
sonoma Add error for getting section by title
May 28, 2020
f1c7f05
sonoma Fix typing issue for age
May 28, 2020
5c9a9ed
sonoma Write parse table function
May 31, 2020
41d61c4
Fix typo
Jun 7, 2020
06163e2
sonoma Comment and typing fixes
Jun 7, 2020
ba6df28
Use raw string for regex
Jun 7, 2020
ad0e174
Merge branch 'sonoma' of github.com:sfbrigade/data-covid19-sfbayarea …
Jun 7, 2020
2bf3faf
sonoma Remove commented out code
Jun 7, 2020
bac1b5b
sonoma Remove unused variable
Jun 7, 2020
6a0ef8c
sonoma Add sonoma to init.py
Jun 8, 2020
15456e1
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
Jun 17, 2020
329f92d
sonoma Correct conventions for sonoma
Jun 17, 2020
3822877
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
Jul 30, 2020
f070125
Fix conflicts
Jul 30, 2020
d1aec84
Fix error import
Jul 30, 2020
1deaa9c
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
Aug 5, 2020
869418a
Fix linter errors and import
Aug 6, 2020
6ef13b4
Add type aliases
Aug 8, 2020
5fdc2aa
Use get cell function for cases
ldtcooper Aug 8, 2020
aed862f
Remove data model readme from main readme
ldtcooper Aug 11, 2020
898672d
Add readme link
ldtcooper Aug 11, 2020
a549ea4
Refactor test and gender functions
ldtcooper Aug 13, 2020
97b72c1
Refactor all transforn functions but cases
ldtcooper Aug 13, 2020
28df7be
Fix types
ldtcooper Aug 13, 2020
6ddf682
Add docstrings
ldtcooper Aug 13, 2020
4a92856
Use datetime attribute
ldtcooper Aug 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 155 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,158 @@ To install this project, you can simply run `sh install.sh` in your terminal. Th
To run the scraper, you can use the run script by typing `sh run_scraper.sh` into your terminal. This will enable the virtual environment and run `scraper.py`. Once again, the virtual environment will not stay active after the script finishes running. If you want to run the scraper without the run script, enable the virtual environment, then run `python3 scraper.py`.

## Running the API
The best way to run the API right now is to run the command `FLASK_APP="app.py" FLASK_ENV=development flask run;`. Note that this is not the best way to run the scraper at this time.
The best way to run the API right now is to run the command `FLASK_APP="app.py" FLASK_ENV=development flask run;`. Note that this is not the best way to run the scraper at this time.

## Data Model
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
The following sections document the differences between the counties in the common data model (see `data_models` directory) which we will see as we begin to get data from them.

### Ages

Please make sure to use the following age brackets for the different counties. Note that the brackets may also vary by whether you are scraping cases or deaths data:


#### San Francisco
##### Cases
"age": [
{"group": "18_and_under", "raw_count": -1 },
{"group": "18_to_30", "raw_count": -1 },
{"group": "31_to_40", "raw_count": -1 },
{"group": "41_to_50", "raw_count": -1 },
{"group": "51_to_60", "raw_count": -1 },
{"group": "61_to_70", "raw_count": -1 },
{"group": "71_to_80", "raw_count": -1 },
{"group": "81_and_older", "raw_count": -1}
]
##### Deaths
Data broken down by gender is not available on the json files, only on the dashboard.


#### Alameda
##### Cases
"age": [
{"group": "18_and_under", "raw_count": -1 },
{"group": "18_to_30", "raw_count": -1 },
{"group": "31_to_40", "raw_count": -1 },
{"group": "41_to_50", "raw_count": -1 },
{"group": "51_to_60", "raw_count": -1 },
{"group": "61_to_70", "raw_count": -1 },
{"group": "71_to_80", "raw_count": -1 },
{"group": "81_and_older", "raw_count": -1 },
{"group": "Unknown", "raw_count": -1 }
]
##### Deaths
Data broken down by gender is not available.


#### Sonoma
##### Cases
"age": [
{"group": "0_to_17", "raw_count": -1 },
{"group": "18_to_49", "raw_count": -1 },
{"group": "50_to_64", "raw_count": -1 },
{"group": "65_and_older", "raw_count": -1 },
{"group": "Unknown", "raw_count": -1 }
]
##### Deaths
Data broken down by gender is not available.


#### Santa Clara
##### Cases
"age": [
{"group": "20_and_under", "raw_count": -1 },
{"group": "21_to_30", "raw_count": -1 },
{"group": "31_to_40", "raw_count": -1 },
{"group": "41_to_50", "raw_count": -1 },
{"group": "51_to_60", "raw_count": -1 },
{"group": "61_to_70", "raw_count": -1 },
{"group": "71_to_80", "raw_count": -1 },
{"group": "81_to_90", "raw_count": -1 },
{"group": "90_and_older", "raw_count": -1 },
{"group": "Unknown", "raw_count": -1 }
]
##### Deaths
"age": [
{"group": "20_and_under", "raw_count": -1 },
{"group": "21_to_30", "raw_count": -1 },
{"group": "31_to_40", "raw_count": -1 },
{"group": "41_to_50", "raw_count": -1 },
{"group": "51_to_60", "raw_count": -1 },
{"group": "61_to_70", "raw_count": -1 },
{"group": "71_to_80", "raw_count": -1 },
{"group": "81_to_90", "raw_count": -1 },
{"group": "90_and_older", "raw_count": -1 }
]


#### San Mateo
##### Cases
"age": [
{"group": "0_to_19", "raw_count": -1 },
{"group": "20_to_29", "raw_count": -1 },
{"group": "30_to_39", "raw_count": -1 },
{"group": "40_to_49", "raw_count": -1 },
{"group": "50_to_59", "raw_count": -1 },
{"group": "60_to_69", "raw_count": -1 },
{"group": "70_to_79", "raw_count": -1 },
{"group": "80_to_89", "raw_count": -1 },
{"group": "90_and_older", "raw_count": -1 }
]
##### Deaths
age": [
{"group": "0_to_19", "raw_count": -1 },
{"group": "20_to_29", "raw_count": -1 },
{"group": "30_to_39", "raw_count": -1 },
{"group": "40_to_49", "raw_count": -1 },
{"group": "50_to_59", "raw_count": -1 },
{"group": "60_to_69", "raw_count": -1 },
{"group": "70_to_79", "raw_count": -1 },
{"group": "80_to_89", "raw_count": -1 },
{"group": "90_and_older", "raw_count": -1 }
]


#### Contra Costa
##### Cases
age": [
{"group": "0_to_20", "raw_count": -1 },
{"group": "21_to_40", "raw_count": -1 },
{"group": "41_to_60", "raw_count": -1 },
{"group": "61_to_80", "raw_count": -1 },
{"group": "81_to_100", "raw_count": -1 }
]
##### Deaths
Data broken down by gender is not available.


#### Marin
##### Cases and Deaths
age": [
{"group": "0_to_18", "raw_count": -1 },
{"group": "19_to_34", "raw_count": -1 },
{"group": "35_to_49", "raw_count": -1 },
{"group": "50_to_64", "raw_count": -1 },
{"group": "65_and_older", "raw_count": -1 }
]



#### Solano
##### Cases and Deaths
age": [
{"group": "0_to_18", "raw_count": -1 },
{"group": "19_to_64", "raw_count": -1 },
{"group": "65_and_older", "raw_count": -1 }
]


#### Napa
##### Cases
age": [
{"group": "0_to_17", "raw_count": -1 },
{"group": "18_to_49", "raw_count": -1 },
{"group": "50_to_64", "raw_count": -1 },
{"group": "Over_64", "raw_count": -1 }
]
##### Deaths
Data broken down by gender is not available.
216 changes: 216 additions & 0 deletions data_scrapers/sonoma_county.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
#!/usr/bin/env python3
import requests
import json
import re
from datetime import datetime
from typing import List, Dict, Union
from bs4 import BeautifulSoup, element # type: ignore

def get_rows(tag: element.Tag) -> List[element.ResultSet]:
"""
Gets all tr elements in a tag but the first, which is the header
"""
return tag.findAll('tr')[1:]
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved

def get_cells(row: element.ResultSet) -> List[str]:
"""
Gets all th and tr elements within a single tr element
"""
return [el.text for el in row.findAll(['th', 'td'])]

def generate_update_time(soup: BeautifulSoup) -> str:
"""
Generates a timestamp string (e.g. May 6, 2020 10:00 AM) for when the scraper is run
"""
update_time_text = soup.find('time').text.strip()
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
update_datetime = datetime.strptime(update_time_text, '%B %d, %Y %I:%M %p')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over in news, I use dateutil for this:

date = dateutil.parser.parse(date_string)

strptime works, but ISO 8601 can come in a few different formats, and they could make changes that are still valid but don’t work with your strptime format string. dateutil.parser will handle them, though.

Also, we should make sure the output has a time zone.

return update_datetime.isoformat()

def get_source_meta(soup: BeautifulSoup) -> str:
"""
Finds the 'Definitions' header on the page and gets all of the text in it
"""
definitions_header = soup.find('h3', string='Definitions')
definitions_text = definitions_header.find_parent().text
return definitions_text.replace('\n', ' ')
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved

# apologies for this horror of a output type
def transform_cases(cases_tag: element.Tag) -> Dict[str, List[Dict[str, Union[str, int]]]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this result type is worth defining a type alias or maybe a data class for? It’s fairly complex.

"""
Takes in a BeautifulSoup tag for the cases table and returns all cases
(historic and active), deaths, and recoveries in the form:
{ 'cases': [], 'deaths': [], 'recovered': [], 'active': [] }
Where each list contains dictionaries (representing each day's data)
of form (example for cases):
{ 'date': '', 'cases': -1, 'cumul_cases': -1 }
"""
cases = []
cumul_cases = 0
deaths = []
cumul_deaths = 0
recovered = []
cumul_recovered = 0
active = []
cumul_active = 0
rows = get_rows(cases_tag)
for row in rows:
row_cells = row.findAll(['th', 'td'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be using get_cells()?

I played around with the idea of having a generic function to get the rows and cells from a tag, loop through them, and take a function to do any extra cleaning/transformation on them, but that made things more complicated and hard-to-read IMO.

Or, would a better abstraction be something like (warning: untested code):

PUNCTUATION = re.compile(r'[\s,;*"\'()\]]+')
def parse_table(table_tag) -> List[Dict[str, str]]:
    """
    Return a list of dicts representing the data rows of an HTML table tag.
    The dict keys are based on the header row's text, but simplified to be
    all lower-case with no spaces.
    """
    headings, *rows = table_tag.find_all('tr')
    # NOTE: could create a namedtuple if we want fancy attribute access instead
    headings = [PUNCTUATION.sub('_', text.strip().lower()) for text in headings]
    results = []
    for row in rows:
        pairs = zip(headings, row.find_all(['th', 'td'])
        # NOTE: could auto-parse ints when the cell text isnumeric() or == '-'
        rows.append({heading: cell.text.strip()
                     for heading, cell in pairs})

# later on...
for row in parse_table(cases_tag):
    active_cases = int(row['active'])

It doesn’t go so far as to to try and rename or restructure things, but at least protects you from columns being reordered and maybe makes access at least a little simpler.

# print(type(row_cells))
date = row_cells[0].text.replace('/', '-')
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved

# instead of 0, this dashboard reports the string '-'
active_cases, new_infected, dead, recoveries = [0 if el.text == '–' else int(el.text) for el in row_cells[1:]]

cumul_cases += new_infected
cases.append({ 'date': date, 'cases': new_infected, 'cumul_cases': cumul_cases })

new_deaths = dead - cumul_deaths
deaths.append({ 'date': date, 'deaths': new_deaths, 'cumul_deaths': dead })

new_recovered = recoveries - cumul_recovered
recovered.append({ 'date': date, 'recovered': new_recovered, 'cumul_recovered': recoveries })

new_active = active_cases - cumul_active
active.append({ 'date': date, 'active': new_active, 'cumul_active': active_cases })

return { 'cases': cases, 'deaths': deaths, 'recovered': recovered, 'active': active }
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved

def transform_transmission(transmission_tag: element.Tag) -> Dict[str, int]:
"""
Takes in a BeautifulSoup tag for the transmissions table and breaks it into
a dictionary of type:
{'community': -1, 'from_contact': -1, 'travel': -1, 'unknown': -1}
"""
transmissions = {}
rows = get_rows(transmission_tag)
# turns the transmission categories on the page into the ones we're using
transmission_type_conversion = {'Community': 'community', 'Close Contact': 'from_contact', 'Travel': 'travel', 'Under Investigation': 'unknown'}
for row in rows:
type, number, _pct = get_cells(row)
if type not in transmission_type_conversion:
raise FutureWarning('The transmission type {0} was not found in transmission_type_conversion'.format(type))
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
type = transmission_type_conversion[type]
transmissions[type] = int(number)
return transmissions

def transform_tests(tests_tag: element.Tag) -> Dict[str, int]:
tests = {}
rows = get_rows(tests_tag)
for row in rows:
result, number, _pct = get_cells(row)
lower_res = result.lower()
tests[lower_res] = int(number.replace(',', ''))
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
return tests;

def generic_transform(tag: element.Tag) -> Dict[str, int]:
"""
Transform function for tables which don't require any special processing.
Takes in a BeautifulSoup tag for a table and returns a dictionary
in which the keys are strings and the values integers
"""
categories = {}
rows = get_rows(tag)
for row in rows:
cat, cases, _pct = get_cells(row)
categories[cat] = int(cases)
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
return categories

def get_unknown_race(race_eth_tag: element.Tag) -> int:
"""
Gets the notes under the 'Cases by race and ethnicity' table to find the
number of cases where the person's race is unknown
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
"""
parent = race_eth_tag.parent
note = parent.find('p').text
matches = re.search('(\d+) \(\d{1,3}%\) missing race/ethnicity', note)
if not matches:
raise FutureWarning('The format of the note with unknown race data has changed')
return(int(matches.groups()[0]))

def transform_race_eth(race_eth_tag: element.Tag) -> Dict[str, int]:
"""
Takes in the BeautifulSoup tag for the cases by race/ethnicity table and
transforms it into an object of form:
'race_eth': {'Asian': -1, 'Latinx_or_Hispanic': -1, 'Other': -1, 'White':-1, 'Unknown': -1}
NB: These are the only races reported seperatley by Sonoma county at this time
"""
race_cases = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Alameda, I believe @elaguerta loaded the defaults from the data model JSON so that any race/ethnicities we don’t have values for come through as -1 or 0 instead of being unlisted. We should probably do the same here.

race_transform = {'Asian/Pacific Islander, non-Hispanic': 'Asian', 'Hispanic/Latino': 'Latinx_or_Hispanic', 'Other*, non-Hispanic': 'Other', 'White, non-Hispanic': 'White'}
rows = get_rows(race_eth_tag)
for row in rows:
group_name, cases, _pct = get_cells(row)
if group_name not in race_transform:
raise FutureWarning('The racial group {0} is new in the data -- please adjust the scraper accordingly')
internal_name = race_transform[group_name]
race_cases[internal_name] = int(cases)
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
race_cases['Unknown'] = get_unknown_race(race_eth_tag)
return race_cases

def transform_total_hospitalizations(hospital_tag: element.Tag) -> Dict[str, int]:
"""
Takes in a BeautifulSoup tag of the cases by hospitalization table and
returns a dictionary with the numbers of hospitalized and non-hospitalized
cases
"""
hospitalizations = {}
rows = get_rows(hospital_tag)
for row in rows:
hospitalized, number, _pct = get_cells(row)
if hospitalized == 'Yes':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth a case-insensitive comparison, just in case this changes in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we would use the chhs as a common source for all hospitalization records. It looks like this is where most counties are getting their information, and we can be sure to have some sort of standard across counties by going straight to the chhs API. This referenced in issue #29. This is why the data_model doesn't currently include a hospitalization sub-table, I thought we could add it separately from the individual county scrapers. I decided to exclude hospitalization data available through SF county, to avoid confusion when we do get to #29.

hospitalizations['hospitalized'] = int(number)
else:
hospitalizations['not_hospitalized'] = int(number)
return hospitalizations

def transform_gender_hospitalizations(hospital_tag: element.Tag) -> Dict[str, float]:
"""
Takes in a BeautifulSoup tag representing the percent of cases hospitalized
by gender and returns a dictionary of those percentages in float form
e.g. 9% is 0.09
"""
hospitalized = {}
rows = get_rows(hospital_tag)
for row in rows:
gender, no, yes = get_cells(row)
yes_int = int(yes.replace('%', ''))
hospitalized[gender] = (yes_int / 100)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to lower-case gender here?

return hospitalized

def get_county() -> Dict:
"""Main method for populating county data .json"""
url = 'https://socoemergency.org/emergency/novel-coronavirus/coronavirus-cases/'
page = requests.get(url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to call page.raise_for_status() just in case you got an error page.

sonoma_soup = BeautifulSoup(page.content, 'html.parser')
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
tables = sonoma_soup.findAll('table')[4:] # we don't need the first three tables

try:
# we have a lot more data here than we are using
hist_cases, cases_by_source, cases_by_race, total_tests, cases_by_region, region_guide, hospitalized, underlying_cond, symptoms, cases_by_gender, underlying_cond_by_gender, hospitalized_by_gender, symptoms_female, symptoms_male, symptoms_desc, cases_by_age, symptoms_by_age, underlying_cond_by_age = tables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super convenient, but I worry about the order changing here. Maybe safer to call sonoma_soup.find(re.compile(r'^h\d$'), string='whatever') for each?

except ValueError:
raise FutureWarning('The number of values on the page has changed -- please adjust the scraper')

model = {
'name': 'Sonoma County',
'update_time': generate_update_time(sonoma_soup),
'source': url,
'meta_from_source': get_source_meta(sonoma_soup),
'meta_from_baypd': 'Racial "Other" category includes "Black/African American, American Indian/Alaska Native, and Other"',
'series': transform_cases(hist_cases),
'case_totals': {
'transmission_cat': transform_transmission(cases_by_source),
'age_group': generic_transform(cases_by_age),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets us the wrong format:

        "age_group": {
            "0-17": 58,
            "18-49": 190,
            "50-64": 78,
            "65 and Above": 47,
            "Under Investigation": 0
        },

But we should have a list, and the group names should be slightly different: https://github.com/sfbrigade/data-covid19-sfbayarea/tree/master/data_models#cases-2

[
	{"group": "0_to_17", "raw_count": -1 },
	{"group": "18_to_49", "raw_count": -1 },
	{"group": "50_to_64", "raw_count": -1 },
	{"group": "65_and_older", "raw_count": -1 },
	{"group": "Unknown", "raw_count": -1 }
]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our data_model.json had inconsistent use of "age_group" values (only 1 instance was a list). I changed all the "age_group" tables to be lists in #52, in progress.
As far as using the group names, I realized I need to go do that for Alameda County and SF county. And we will probably have to manually order the list by group-name, since they're often coming through as unordered .json objects (in curly braces).

'race_eth': transform_race_eth(cases_by_race),
'gender': generic_transform(cases_by_gender)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outputting different keys than we have for hospitalizations below. We should unify them.

Here it’s plural:

        "gender": {
            "Males": 178,
            "Females": 195
        }

In hospitalizations it’s singular:

    "hospitalizations": {
        "hospitalized_cases": {
            "hospitalized": 35,
            "not_hospitalized": 338
        },
        "gender": {
            "Male": 0.08,
            "Female": 0.1
        }
    }

},
'tests_totals': {
'tests': transform_tests(total_tests),
},
'hospitalizations': {
'hospitalized_cases': transform_total_hospitalizations(hospitalized),
'gender': transform_gender_hospitalizations(hospitalized_by_gender)
}
}
return model

if __name__ == '__main__':
print(json.dumps(get_county(), indent=4))