Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCES_PublicSchoolStats_import_refresh #1142

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions scripts/us_nces/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# US: National Center for Education Statistics

## About the Dataset
This dataset has Population Estimates for the National Center for Education Statistics in US for
- Private School - 1997-2019
- School District - 2010-2023
- Public Schools - 2010-2023

The population is categorized on various attributes and their combinations:

1. Count of Students categorised based on Race.
2. Count of Students categorised based on School Grade from Pre Kindergarten to Grade 13.
3. Count of Students categorised based on Pupil/Teacher Ratio.
4. Count of Full-Time Equivalent (FTE) Teachers.
5. Count of Students with a combination of Race and School Grade.
6. Count of Students with a combination of Race and Gender.
7. Count of Students with a combination of School Grade and Gender.
8. Count of Students with a combination of Race, School Grade and Gender.
9. Count of Students under Ungraded Classes, Ungraded Students and Adult Education Students.

The Place properties of Schools are given as below:
(All the properties are available based on the type of School)

1. `County Number`
2. `School Name`
3. `School ID - NCES Assigned`
4. `Lowest Grade Offered`
5. `Highest Grade Offered`
6. `Phone Number`
7. `ANSI/FIPS State Code`
8. `Location Address 1`
9. `Location City`
10. `Location ZIP`
11. `Magnet School`
12. `Charter School`
13. `School Type`
14. `Title I School Status`
15. `National School Lunch Program`
16. `School Level (SY 2017-18 onward)`
17. `State School ID`
18. `State Agency ID`
19. `State Abbr`
20. `Agency Name`
21. `Location ZIP4`
22. `Agency ID - NCES Assigned`
23. `School Level`
24. `State Name`


### Download URL
The data in .csv formats are downloadable from https://nces.ed.gov/ccd/elsi/tableGenerator.aspx -> .


#### API Output
The attributes used for the import are as follows
| Attribute | Description |
|-------------------------------------------------------|------------------------------------------------------------------------------------------------|
| time | The Year of the population estimates provided. |
| geo | The Area of the population estimates provided. |
| Race | The Number of Students categorised under race. |
| Gender | The Number of Students categorised under Gender. |
| School Grade | The Number of Students categorised under School Grade. |
| Full-Time Equivalent | The Count of Teachers Available. |



### Import Procedure

#### Downloading the input files using scripts.
- There are 3 scripts created to download the input files.
- fetch_ncid.py
- download_config.py
- download_file_details.py
- download.py

#### fetch_ncid.py script
- The code is a Python function that retrieves National Center for Education Statistics (NCES) IDs for a given school and year. It automates interacting with a webpage using Selenium to select options and then extracts the corresponding IDs.

##### download_config.py script
- The download_config.py script has all the configurations required to download the file.
- It has filter, download url, maximum tries etc. The values are same under all cases.

##### download_config.py script
- The download_file_details.py script has values for "default column", "columns to be downloaded" and "key coulmns".
- Every input file can only accommodate 60 columns. In Public Schools multiple input files will be downloaded. All these input files will have a common column called as "Key Column" which acts as primary key.
- In the "Public columns to be downloaded" create a list of columns.
-ex: PUBLIC_COLUMNS = ["State Name [Public School]", "State Abbr [Public School]", "School Name [Public School]"]
- Steps to add columns to the list.
- Under "Select Table Columns"
- select the "Information" tab
- expand the hit area "BasicInformation"
- right click on the desired column checkbox and choose inspect
- from the elements on the right hand side, check the number assigned to "value" and add confirm the column under that list which corresponds to value.

##### download.py script
- The download.py script is the main script. It considers the import_name and year to be downloaded. It downloads, extracts and places the input csv in "input_files" folder under the desired school directory.


### Command to Download input file
- `/bin/python3 scripts/us_nces/demographics/download.py --import_name={"PrivateSchool"(or)"District"(or)"PublicSchool"} --years_to_download= "{select the available years mentioned under each school type}"`

For Example: `/bin/python3 scripts/us_nces/demographics/download.py --import_name="PublicSchool" --years_to_download="2023"`.
- The input_files folder containing all the files will be present in:
`scripts/us_nces/demographics/public_school/input_files`
- Note: Give one year at a time for District and Public Schools as there are large number of column values.


#### Cleaned Data
import_name consists of the school name being used
- "private_school"
- "district_school"
- "public_school"

Cleaned data will be saved as a CSV file within the following paths.
- private_school -> [private_school/output_files/us_nces_demographics_private_school.csv]
- district_school -> [school_district/output_files/us_nces_demographics_district_school.csv]
- public_school -> [public_school/output_files/us_nces_demographics_public_school.csv]

The Columns for the csv files are as follows
- school_state_code
- year
- sv_name
- observation
- scaling_factor
- unit


#### MCFs and Template MCFs
- private_school -> [private_school/output_files/us_nces_demographics_private_school.mcf]
[private_school/output_files/us_nces_demographics_private_school.tmcf]


- district_school -> [school_district/output_files/us_nces_demographics_district_school.mcf],
[school_district/output_files/us_nces_demographics_district_school.tmcf]


- public_school -> [public_school/output_files/us_nces_demographics_public_school.mcf],
[public_school/output_files/us_nces_demographics_public_school.tmcf]


#### Cleaned Place
"import_name" consists the type of school being executed.
- "private_school"
- "district_school"
- "public_school"

Cleaned data will be inside as a CSV file with the following paths.
- private_school:
[private_school/output_place/us_nces_demographics_private_place.csv]
- district_school:
[school_district/output_place/us_nces_demographics_district_place.csv]
- public_school:
[public_school/output_place/us_nces_demographics_public_place.csv]

If there are Duplicate School IDs present in School Place, they will be saved inside the same output path as that of csv and tmcf file.
- [scripts/us_nces/demographics/private_school/output_place/dulicate_id_us_nces_demographics_private_place.csv]
- [scripts/us_nces/demographics/school_district/output_place/dulicate_id_us_nces_demographics_district_place.csv]
- [scripts/us_nces/demographics/public_school/output_place/dulicate_id_us_nces_demographics_public_place.csv]


#### Template MCFs Place
- private_school:
[private_school/output_place/us_nces_demographics_private_place.tmcf]
- district_school:
[school_district/output_place/us_nces_demographics_district_place.tmcf]
- public_school:
[public_school/output_place/us_nces_demographics_public_place.tmcf]

### Running Tests

Run the test cases

- `/bin/python3 -m unittest scripts/us_nces/demographics/private_school/process_test.py`
- `/bin/python3 -m unittest scripts/us_nces/demographics/school_district/process_test.py`
- `/bin/python3 -m unittest scripts/us_nces/demographics/public_school/process_test.py`

44 changes: 44 additions & 0 deletions scripts/us_nces/common/dcid__mcf_existance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""_summary_
Script to check the property/dcid/nodes existance in datacommons.org.
"""
import datacommons


def check_dcid_existence(nodes: list) -> dict:
"""
Checks the existance of dcid nodes in autopush.
True: Available
False: Unavailable
Args:
nodes (list): Dcid Nodes List
Returns:
dict: Status dictionary.
"""
# pylint: disable=protected-access
nodes_response = datacommons.get_property_values(
nodes,
"typeOf",
out=True,
value_type=None,
limit=datacommons.utils._MAX_LIMIT)
# pylint: enable=protected-access
node_status = {}
for node, value in nodes_response.items():
if value == []:
node_status[node] = False
else:
node_status[node] = True
return node_status
131 changes: 131 additions & 0 deletions scripts/us_nces/common/dcid_existance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
import datacommons as dc
import time
import logging
import requests_cache
import urllib

dc.utils._API_ROOT = 'http://api.datacommons.org'


def dc_api_wrapper(function,
args: dict,
retries: int = 3,
retry_secs: int = 5,
use_cache: bool = False,
api_root: str = None):
'''Returns the result from the DC APi call function.
Retries the function in case of errors.

Args:
function: The DataCOmmons API function.
args: dictionary with ann the arguments fr the DataCommons APi function.
retries: Number of retries in case of HTTP errors.
retry_sec: Interval in seconds between retries for which caller is blocked.
use_cache: If True, uses request cache for faster response.
api_root: The API server to use. Default is 'http://api.datacommons.org'.
To use autopush with more recent data, set it to 'http://autopush.api.datacommons.org'

'Returns:
The response from the DataCommons API call.
'''
if api_root:
dc.utils._API_ROOT = api_root
logging.debug(f'Setting DC API root to {api_root} for {function}')
if not retries or retries <= 0:
retries = 1
if not requests_cache.is_installed():
requests_cache.install_cache(expires_after=300)
cache_context = None
if use_cache:
cache_context = requests_cache.enabled()
logging.debug(f'Using requests_cache for DC API {function}')
else:
cache_context = requests_cache.disabled()
logging.debug(f'Using requests_cache for DC API {function}')
with cache_context:
for attempt in range(0, retries):
try:
logging.debug(
f'Invoking DC API {function} with {args}, retries={retries}'
)
return function(**args)
except KeyError:
# Exception in case of API error.
return None
except urllib.error.URLError:
# Exception when server is overloaded, retry after a delay
if attempt >= retries:
raise RuntimeError
else:
logging.debug(
f'Retrying API {function} after {retry_secs}...')
time.sleep(retry_secs)
return None


def dc_api_batched_wrapper(function, dcids: list, args: dict,
config: dict) -> dict:
'''Returns the dictionary result for the function cal on all APIs.
It batches the dcids to make multiple calls to the DC API and merges all results.

Args:
function: DC API to be invoked. It should have dcids as one of the arguments
and should return a dictionary with dcid as the key.
dcids: List of dcids to be invoked with the function.
args: Additional arguments for the function call.
config: dictionary of DC API configuration settings.
The supported settings are:
dc_api_batch_size: Number of dcids to invoke per API call.
dc_api_retries: Number of times an API can be retried.
dc_api_retry_sec: Interval in seconds between retries.
dc_api_use_cache: Enable/disable request cache for the DC API call.
dc_api_root: The server to use fr the DC API calls.

Returns:
Merged function return values across all dcids.
'''
api_result = {}
index = 0
num_dcids = len(dcids)
api_batch_size = config.get('dc_api_batch_size', 10)
logging.info(
f'Calling DC API {function} on {len(dcids)} dcids in batches of {api_batch_size} with args: {args}...'
)
while index < num_dcids:
# dcids in batches.
dcids_batch = [
# strip_namespace(x) for x in dcids[index:index + api_batch_size]
x for x in dcids[index:index + api_batch_size]
]
index += api_batch_size
args['dcids'] = dcids_batch
batch_result = dc_api_wrapper(function, args,
config.get('dc_api_retries', 3),
config.get('dc_api_retry_secs', 5),
config.get('dc_api_use_cache', False),
config.get('dc_api_root', None))
if batch_result:
api_result.update(batch_result)
logging.debug(f'Got DC API result for {function}: {batch_result}')
return api_result


def dc_api_get_defined_dcids(dcids: list, config: dict) -> dict:
'''Returns a dict with dcids mapped to list of types in the DataCommons KG.
Uses the property_value API to lookup 'typeOf' for each dcid.
dcids not defined in KG are dropped in the response.
'''
api_function = dc.get_property_values
args = {
'prop': 'typeOf',
'out': True,
}
api_result = dc_api_batched_wrapper(api_function, dcids, args, config)
response = {}
# Remove empty results for dcids not defined in KG.
for dcid in dcids:
if dcid in api_result and api_result[dcid]:
response[dcid] = True
else:
response[dcid] = False
return response
Loading