-
Notifications
You must be signed in to change notification settings - Fork 113
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
5c49edd
commit 4fc488c
Showing
51 changed files
with
37,170 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
# US: National Center for Education Statistics | ||
|
||
## About the Dataset | ||
This dataset has Population Estimates for the National Center for Education Statistics in US for | ||
- Private School - 1997-2019 | ||
- School District - 2010-2023 | ||
- Public Schools - 2010-2023 | ||
|
||
The population is categorized on various attributes and their combinations: | ||
|
||
1. Count of Students categorised based on Race. | ||
2. Count of Students categorised based on School Grade from Pre Kindergarten to Grade 13. | ||
3. Count of Students categorised based on Pupil/Teacher Ratio. | ||
4. Count of Full-Time Equivalent (FTE) Teachers. | ||
5. Count of Students with a combination of Race and School Grade. | ||
6. Count of Students with a combination of Race and Gender. | ||
7. Count of Students with a combination of School Grade and Gender. | ||
8. Count of Students with a combination of Race, School Grade and Gender. | ||
9. Count of Students under Ungraded Classes, Ungraded Students and Adult Education Students. | ||
|
||
The Place properties of Schools are given as below: | ||
(All the properties are available based on the type of School) | ||
|
||
1. `County Number` | ||
2. `School Name` | ||
3. `School ID - NCES Assigned` | ||
4. `Lowest Grade Offered` | ||
5. `Highest Grade Offered` | ||
6. `Phone Number` | ||
7. `ANSI/FIPS State Code` | ||
8. `Location Address 1` | ||
9. `Location City` | ||
10. `Location ZIP` | ||
11. `Magnet School` | ||
12. `Charter School` | ||
13. `School Type` | ||
14. `Title I School Status` | ||
15. `National School Lunch Program` | ||
16. `School Level (SY 2017-18 onward)` | ||
17. `State School ID` | ||
18. `State Agency ID` | ||
19. `State Abbr` | ||
20. `Agency Name` | ||
21. `Location ZIP4` | ||
22. `Agency ID - NCES Assigned` | ||
23. `School Level` | ||
24. `State Name` | ||
|
||
### Download URL | ||
The data in .csv formats are downloadable from https://nces.ed.gov/ccd/elsi/tableGenerator.aspx -> . | ||
|
||
|
||
#### API Output | ||
The attributes used for the import are as follows | ||
| Attribute | Description | | ||
|-------------------------------------------------------|------------------------------------------------------------------------------------------------| | ||
| time | The Year of the population estimates provided. | | ||
| geo | The Area of the population estimates provided. | | ||
| Race | The Number of Students categorised under race. | | ||
| Gender | The Number of Students categorised under Gender. | | ||
| School Grade | The Number of Students categorised under School Grade. | | ||
| Full-Time Equivalent | The Count of Teachers Available. | | ||
|
||
|
||
|
||
### Import Procedure | ||
|
||
#### Downloading the input files using scripts. | ||
- There are 3 scripts created to download the input files. | ||
- fetch_ncid.py | ||
- download_config.py | ||
- download_file_details.py | ||
- download.py | ||
|
||
#### fetch_ncid.py script | ||
- The code is a Python function that retrieves National Center for Education Statistics (NCES) IDs for a given school and year. It automates interacting with a webpage using Selenium to select options and then extracts the corresponding IDs. | ||
|
||
##### download_config.py script | ||
- The download_config.py script has all the configurations required to download the file. | ||
- It has filter, download url, maximum tries etc. The values are same under all cases. | ||
|
||
##### download_config.py script | ||
- The download_file_details.py script has values for "default column", "columns to be downloaded" and "key coulmns". | ||
- Every input file can only accommodate 60 columns. In Public Schools multiple input files will be downloaded. All these input files will have a common column called as "Key Column" which acts as primary key. | ||
- In the "Public columns to be downloaded" create a list of columns. | ||
-ex: PUBLIC_COLUMNS = ["State Name [Public School]", "State Abbr [Public School]", "School Name [Public School]"] | ||
- Steps to add columns to the list. | ||
- Under "Select Table Columns" | ||
- select the "Information" tab | ||
- expand the hit area "BasicInformation" | ||
- right click on the desired column checkbox and choose inspect | ||
- from the elements on the right hand side, check the number assigned to "value" and add confirm the column under that list which corresponds to value. | ||
|
||
##### download.py script | ||
- The download.py script is the main script. It considers the import_name and year to be downloaded. It downloads, extracts and places the input csv in "input_files" folder under the desired school directory. | ||
|
||
|
||
### Command to Download input file | ||
- `/bin/python3 scripts/us_nces/demographics/download.py --import_name={"PrivateSchool"(or)"District"(or)"PublicSchool"} --years_to_download= "{select the available years mentioned under each school type}"` | ||
|
||
For Example: `/bin/python3 scripts/us_nces/demographics/download.py --import_name="PublicSchool" --years_to_download="2023"`. | ||
- The input_files folder containing all the files will be present in: | ||
`scripts/us_nces/demographics/public_school/input_files` | ||
- Note: Give one year at a time for District and Public Schools as there are large number of column values. | ||
|
||
|
||
#### Cleaned Data | ||
import_name consists of the school name being used | ||
- "private_school" | ||
- "district_school" | ||
- "public_school" | ||
|
||
Cleaned data will be saved as a CSV file within the following paths. | ||
- private_school -> [private_school/output_files/us_nces_demographics_private_school.csv] | ||
- district_school -> [school_district/output_files/us_nces_demographics_district_school.csv] | ||
- public_school -> [public_school/output_files/us_nces_demographics_public_school.csv] | ||
|
||
The Columns for the csv files are as follows | ||
- school_state_code | ||
- year | ||
- sv_name | ||
- observation | ||
- scaling_factor | ||
- unit | ||
|
||
|
||
#### MCFs and Template MCFs | ||
- private_school -> [private_school/output_files/us_nces_demographics_private_school.mcf] | ||
[private_school/output_files/us_nces_demographics_private_school.tmcf] | ||
|
||
|
||
- district_school -> [school_district/output_files/us_nces_demographics_district_school.mcf], | ||
[school_district/output_files/us_nces_demographics_district_school.tmcf] | ||
|
||
|
||
- public_school -> [public_school/output_files/us_nces_demographics_public_school.mcf], | ||
[public_school/output_files/us_nces_demographics_public_school.tmcf] | ||
|
||
|
||
#### Cleaned Place | ||
"import_name" consists the type of school being executed. | ||
- "private_school" | ||
- "district_school" | ||
- "public_school" | ||
|
||
Cleaned data will be inside as a CSV file with the following paths. | ||
- private_school: | ||
[private_school/output_place/us_nces_demographics_private_place.csv] | ||
- district_school: | ||
[school_district/output_place/us_nces_demographics_district_place.csv] | ||
- public_school: | ||
[public_school/output_place/us_nces_demographics_public_place.csv] | ||
|
||
If there are Duplicate School IDs present in School Place, they will be saved inside the same output path as that of csv and tmcf file. | ||
- [scripts/us_nces/demographics/private_school/output_place/dulicate_id_us_nces_demographics_private_place.csv] | ||
- [scripts/us_nces/demographics/school_district/output_place/dulicate_id_us_nces_demographics_district_place.csv] | ||
- [scripts/us_nces/demographics/public_school/output_place/dulicate_id_us_nces_demographics_public_place.csv] | ||
|
||
|
||
#### Template MCFs Place | ||
- private_school: | ||
[private_school/output_place/us_nces_demographics_private_place.tmcf] | ||
- district_school: | ||
[school_district/output_place/us_nces_demographics_district_place.tmcf] | ||
- public_school: | ||
[public_school/output_place/us_nces_demographics_public_place.tmcf] | ||
|
||
### Running Tests | ||
|
||
Run the test cases | ||
|
||
- `/bin/python3 -m unittest scripts/us_nces/demographics/private_school/process_test.py` | ||
- `/bin/python3 -m unittest scripts/us_nces/demographics/school_district/process_test.py` | ||
- `/bin/python3 -m unittest scripts/us_nces/demographics/public_school/process_test.py` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Copyright 2022 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""_summary_ | ||
Script to check the property/dcid/nodes existance in datacommons.org. | ||
""" | ||
import datacommons | ||
|
||
|
||
def check_dcid_existence(nodes: list) -> dict: | ||
""" | ||
Checks the existance of dcid nodes in autopush. | ||
True: Available | ||
False: Unavailable | ||
Args: | ||
nodes (list): Dcid Nodes List | ||
Returns: | ||
dict: Status dictionary. | ||
""" | ||
# pylint: disable=protected-access | ||
nodes_response = datacommons.get_property_values( | ||
nodes, | ||
"typeOf", | ||
out=True, | ||
value_type=None, | ||
limit=datacommons.utils._MAX_LIMIT) | ||
# pylint: enable=protected-access | ||
node_status = {} | ||
for node, value in nodes_response.items(): | ||
if value == []: | ||
node_status[node] = False | ||
else: | ||
node_status[node] = True | ||
return node_status |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
import datacommons as dc | ||
import time | ||
import logging | ||
import requests_cache | ||
import urllib | ||
|
||
dc.utils._API_ROOT = 'http://api.datacommons.org' | ||
|
||
|
||
def dc_api_wrapper(function, | ||
args: dict, | ||
retries: int = 3, | ||
retry_secs: int = 5, | ||
use_cache: bool = False, | ||
api_root: str = None): | ||
'''Returns the result from the DC APi call function. | ||
Retries the function in case of errors. | ||
Args: | ||
function: The DataCOmmons API function. | ||
args: dictionary with ann the arguments fr the DataCommons APi function. | ||
retries: Number of retries in case of HTTP errors. | ||
retry_sec: Interval in seconds between retries for which caller is blocked. | ||
use_cache: If True, uses request cache for faster response. | ||
api_root: The API server to use. Default is 'http://api.datacommons.org'. | ||
To use autopush with more recent data, set it to 'http://autopush.api.datacommons.org' | ||
'Returns: | ||
The response from the DataCommons API call. | ||
''' | ||
if api_root: | ||
dc.utils._API_ROOT = api_root | ||
logging.debug(f'Setting DC API root to {api_root} for {function}') | ||
if not retries or retries <= 0: | ||
retries = 1 | ||
if not requests_cache.is_installed(): | ||
requests_cache.install_cache(expires_after=300) | ||
cache_context = None | ||
if use_cache: | ||
cache_context = requests_cache.enabled() | ||
logging.debug(f'Using requests_cache for DC API {function}') | ||
else: | ||
cache_context = requests_cache.disabled() | ||
logging.debug(f'Using requests_cache for DC API {function}') | ||
with cache_context: | ||
for attempt in range(0, retries): | ||
try: | ||
logging.debug( | ||
f'Invoking DC API {function} with {args}, retries={retries}' | ||
) | ||
return function(**args) | ||
except KeyError: | ||
# Exception in case of API error. | ||
return None | ||
except urllib.error.URLError: | ||
# Exception when server is overloaded, retry after a delay | ||
if attempt >= retries: | ||
raise RuntimeError | ||
else: | ||
logging.debug( | ||
f'Retrying API {function} after {retry_secs}...') | ||
time.sleep(retry_secs) | ||
return None | ||
|
||
|
||
def dc_api_batched_wrapper(function, dcids: list, args: dict, | ||
config: dict) -> dict: | ||
'''Returns the dictionary result for the function cal on all APIs. | ||
It batches the dcids to make multiple calls to the DC API and merges all results. | ||
Args: | ||
function: DC API to be invoked. It should have dcids as one of the arguments | ||
and should return a dictionary with dcid as the key. | ||
dcids: List of dcids to be invoked with the function. | ||
args: Additional arguments for the function call. | ||
config: dictionary of DC API configuration settings. | ||
The supported settings are: | ||
dc_api_batch_size: Number of dcids to invoke per API call. | ||
dc_api_retries: Number of times an API can be retried. | ||
dc_api_retry_sec: Interval in seconds between retries. | ||
dc_api_use_cache: Enable/disable request cache for the DC API call. | ||
dc_api_root: The server to use fr the DC API calls. | ||
Returns: | ||
Merged function return values across all dcids. | ||
''' | ||
api_result = {} | ||
index = 0 | ||
num_dcids = len(dcids) | ||
api_batch_size = config.get('dc_api_batch_size', 10) | ||
logging.info( | ||
f'Calling DC API {function} on {len(dcids)} dcids in batches of {api_batch_size} with args: {args}...' | ||
) | ||
while index < num_dcids: | ||
# dcids in batches. | ||
dcids_batch = [ | ||
# strip_namespace(x) for x in dcids[index:index + api_batch_size] | ||
x for x in dcids[index:index + api_batch_size] | ||
] | ||
index += api_batch_size | ||
args['dcids'] = dcids_batch | ||
batch_result = dc_api_wrapper(function, args, | ||
config.get('dc_api_retries', 3), | ||
config.get('dc_api_retry_secs', 5), | ||
config.get('dc_api_use_cache', False), | ||
config.get('dc_api_root', None)) | ||
if batch_result: | ||
api_result.update(batch_result) | ||
logging.debug(f'Got DC API result for {function}: {batch_result}') | ||
return api_result | ||
|
||
|
||
def dc_api_get_defined_dcids(dcids: list, config: dict) -> dict: | ||
'''Returns a dict with dcids mapped to list of types in the DataCommons KG. | ||
Uses the property_value API to lookup 'typeOf' for each dcid. | ||
dcids not defined in KG are dropped in the response. | ||
''' | ||
api_function = dc.get_property_values | ||
args = { | ||
'prop': 'typeOf', | ||
'out': True, | ||
} | ||
api_result = dc_api_batched_wrapper(api_function, dcids, args, config) | ||
response = {} | ||
# Remove empty results for dcids not defined in KG. | ||
for dcid in dcids: | ||
if dcid in api_result and api_result[dcid]: | ||
response[dcid] = True | ||
else: | ||
response[dcid] = False | ||
return response |
Oops, something went wrong.