-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCES_PublicSchoolStats_import_refresh #1142
Closed
+550,879
−0
Closed
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,155 @@ | ||
# US: National Center for Education Statistics | ||
|
||
## About the Dataset | ||
This dataset has Population Estimates for the National Center for Education Statistics in US for | ||
- Private School - 1997-2019 | ||
- School District - 2010-2023 | ||
- Public Schools - 2010-2023 | ||
|
||
The population is categorized on various attributes and their combinations: | ||
|
||
1. Count of Students categorised based on Race. | ||
2. Count of Students categorised based on School Grade from Pre Kindergarten to Grade 13. | ||
3. Count of Students categorised based on Pupil/Teacher Ratio. | ||
4. Count of Full-Time Equivalent (FTE) Teachers. | ||
5. Count of Students with a combination of Race and School Grade. | ||
6. Count of Students with a combination of Race and Gender. | ||
7. Count of Students with a combination of School Grade and Gender. | ||
8. Count of Students with a combination of Race, School Grade and Gender. | ||
9. Count of Students under Ungraded Classes, Ungraded Students and Adult Education Students. | ||
|
||
The Place properties of Schools are given as below: | ||
(All the properties are available based on the type of School) | ||
|
||
1. `County Number` | ||
2. `School Name` | ||
3. `School ID - NCES Assigned` | ||
4. `Lowest Grade Offered` | ||
5. `Highest Grade Offered` | ||
6. `Phone Number` | ||
7. `ANSI/FIPS State Code` | ||
8. `Location Address 1` | ||
9. `Location City` | ||
10. `Location ZIP` | ||
11. `Magnet School` | ||
12. `Charter School` | ||
13. `School Type` | ||
14. `Title I School Status` | ||
15. `National School Lunch Program` | ||
16. `School Level (SY 2017-18 onward)` | ||
17. `State School ID` | ||
18. `State Agency ID` | ||
19. `State Abbr` | ||
20. `Agency Name` | ||
21. `Location ZIP4` | ||
22. `Agency ID - NCES Assigned` | ||
23. `School Level` | ||
24. `State Name` | ||
|
||
|
||
### Download URL | ||
The data in .csv formats are downloadable from https://nces.ed.gov/ccd/elsi/tableGenerator.aspx -> . | ||
|
||
|
||
#### API Output | ||
The attributes used for the import are as follows | ||
| Attribute | Description | | ||
|-------------------------------------------------------|------------------------------------------------------------------------------------------------| | ||
| time | The Year of the population estimates provided. | | ||
| geo | The Area of the population estimates provided. | | ||
| Race | The Number of Students categorised under race. | | ||
| Gender | The Number of Students categorised under Gender. | | ||
| School Grade | The Number of Students categorised under School Grade. | | ||
| Full-Time Equivalent | The Count of Teachers Available. | | ||
|
||
|
||
|
||
#### Cleaned Data | ||
import_name consists of the school name being used | ||
- "private_school" | ||
- "district_school" | ||
- "public_school" | ||
|
||
Cleaned data will be saved as a CSV file within the following paths. | ||
- private_school -> [private_school/output_files/us_nces_demographics_private_school.csv] | ||
- district_school -> [school_district/output_files/us_nces_demographics_district_school.csv] | ||
- public_school -> [public_school/output_files/us_nces_demographics_public_school.csv] | ||
|
||
The Columns for the csv files are as follows | ||
- school_state_code | ||
- year | ||
- sv_name | ||
- observation | ||
- scaling_factor | ||
- unit | ||
|
||
|
||
#### MCFs and Template MCFs | ||
- private_school -> [private_school/output_files/us_nces_demographics_private_school.mcf] | ||
[private_school/output_files/us_nces_demographics_private_school.tmcf] | ||
|
||
|
||
- district_school -> [school_district/output_files/us_nces_demographics_district_school.mcf], | ||
[school_district/output_files/us_nces_demographics_district_school.tmcf] | ||
|
||
|
||
- public_school -> [public_school/output_files/us_nces_demographics_public_school.mcf], | ||
[public_school/output_files/us_nces_demographics_public_school.tmcf] | ||
|
||
|
||
#### Cleaned Place | ||
"import_name" consists the type of school being executed. | ||
- "private_school" | ||
- "district_school" | ||
- "public_school" | ||
|
||
Cleaned data will be inside as a CSV file with the following paths. | ||
- private_school: | ||
[private_school/output_place/us_nces_demographics_private_place.csv] | ||
- district_school: | ||
[school_district/output_place/us_nces_demographics_district_place.csv] | ||
- public_school: | ||
[public_school/output_place/us_nces_demographics_public_place.csv] | ||
|
||
If there are Duplicate School IDs present in School Place, they will be saved inside the same output path as that of csv and tmcf file. | ||
- [scripts/us_nces/demographics/private_school/output_place/dulicate_id_us_nces_demographics_private_place.csv] | ||
- [scripts/us_nces/demographics/school_district/output_place/dulicate_id_us_nces_demographics_district_place.csv] | ||
- [scripts/us_nces/demographics/public_school/output_place/dulicate_id_us_nces_demographics_public_place.csv] | ||
|
||
|
||
#### Template MCFs Place | ||
- private_school: | ||
[private_school/output_place/us_nces_demographics_private_place.tmcf] | ||
- district_school: | ||
[school_district/output_place/us_nces_demographics_district_place.tmcf] | ||
- public_school: | ||
[public_school/output_place/us_nces_demographics_public_place.tmcf] | ||
|
||
### Running Tests | ||
|
||
Run the test cases | ||
|
||
- `/bin/python3 -m unittest scripts/us_nces/demographics/private_school/process_test.py` | ||
- `/bin/python3 -m unittest scripts/us_nces/demographics/school_district/process_test.py` | ||
- `/bin/python3 -m unittest scripts/us_nces/demographics/public_school/process_test.py` | ||
|
||
|
||
|
||
### Import Procedure | ||
|
||
The below script will download the data and extract it. | ||
|
||
`/bin/python3 scripts/us_nces/demographics/download.py` | ||
|
||
For Example: `/bin/python3 scripts/us_nces/demographics/download.py --import_name="PublicSchool" --years_to_download="2023"`. | ||
- The input_files folder containing all the files will be present in: | ||
`scripts/us_nces/demographics/public_school/input_files` | ||
|
||
|
||
The below script will clean the data, Also generate final csv, mcf and tmcf files. | ||
- for Private Schools: | ||
`/bin/python3 scripts/us_nces/demographics/private_school/process.py` | ||
- for School Districts: | ||
`/bin/python3 scripts/us_nces/demographics/school_district/process.py` | ||
- for Public Schools: | ||
`/bin/python3 scripts/us_nces/demographics/public_school/process.py` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Downloading input files for US: National Center for Education Statistics | ||
|
||
## Dataset Source | ||
- The dataset is extracted from source: https://nces.ed.gov/ccd/elsi/tableGenerator.aspx | ||
|
||
### Process to download the input files | ||
|
||
#### Downloading the input files manually. | ||
- Select the following from each drop down | ||
- Select A Table Row -> Select the type of School to download ("Private" or "District" or "Public") | ||
- Select Years -> Select the available years | ||
- Select Table Columns -> The school properties are catogerised into different tabs. Select the required properties under each tab. | ||
- Select Filter -> Select "All 50 States + DC" | ||
- Click on "Create Table" and it shows all the columns selected. | ||
- Click om the "CSV" button and download the file. | ||
- A zip file will be downloaded which has to be extracted. | ||
|
||
#### Downloading the input files using scripts. | ||
- There are 3 scripts created to download the input files. | ||
- fetch_ncid.py | ||
- download_config.py | ||
- download_file_details.py | ||
- download.py | ||
|
||
#### fetch_ncid.py script | ||
- The code is a Python function that retrieves National Center for Education Statistics (NCES) IDs for a given school and year. It automates interacting with a webpage using Selenium to select options and then extracts the corresponding IDs. | ||
|
||
##### download_config.py script | ||
- The download_config.py script has all the configurations required to download the file. | ||
- It has filter, download url, maximum tries etc. The values are same under all cases. | ||
|
||
##### download_config.py script | ||
- The download_file_details.py script has values for "default column", "columns to be downloaded" and "key coulmns". | ||
- Every input file can only accommodate 60 columns. In Public Schools multiple input files will be downloaded. All these input files will have a common column called as "Key Column" which acts as primary key. | ||
- In the "Public columns to be downloaded" create a list of columns. | ||
-ex: PUBLIC_COLUMNS = ["State Name [Public School]", "State Abbr [Public School]", "School Name [Public School]"] | ||
- Steps to add columns to the list. | ||
- Under "Select Table Columns" | ||
- select the "Information" tab | ||
- expand the hit area "BasicInformation" | ||
- right click on the desired column checkbox and choose inspect | ||
- from the elements on the right hand side, check the number assigned to "value" and add confirm the column under that list which corresponds to value. | ||
|
||
##### download.py script | ||
- The download.py script is the main script. It considers the import_name and year to be downloaded. It downloads, extracts and places the input csv in "input_files" folder under the desired school directory. | ||
|
||
|
||
### Command to Download input file | ||
- `/bin/python3 scripts/us_nces/demographics/download.py --import_name={"PrivateSchool"(or)"District"(or)"PublicSchool"} --years_to_download= "{select the available years mentioned under each school type}"` | ||
|
||
For Example: `/bin/python3 scripts/us_nces/demographics/download.py --import_name="PublicSchool" --years_to_download="2023"`. | ||
- The input_files folder containing all the files will be present in: | ||
`scripts/us_nces/demographics/public_school/input_files` | ||
- Note: Give one year at a time for District and Public Schools as there are large number of column values. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Copyright 2022 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""_summary_ | ||
Script to check the property/dcid/nodes existance in datacommons.org. | ||
""" | ||
import datacommons | ||
|
||
|
||
def check_dcid_existence(nodes: list) -> dict: | ||
""" | ||
Checks the existance of dcid nodes in autopush. | ||
True: Available | ||
False: Unavailable | ||
Args: | ||
nodes (list): Dcid Nodes List | ||
Returns: | ||
dict: Status dictionary. | ||
""" | ||
# pylint: disable=protected-access | ||
nodes_response = datacommons.get_property_values( | ||
nodes, | ||
"typeOf", | ||
out=True, | ||
value_type=None, | ||
limit=datacommons.utils._MAX_LIMIT) | ||
# pylint: enable=protected-access | ||
node_status = {} | ||
for node, value in nodes_response.items(): | ||
if value == []: | ||
node_status[node] = False | ||
else: | ||
node_status[node] = True | ||
return node_status |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
import datacommons as dc | ||
import time | ||
import logging | ||
import requests_cache | ||
import urllib | ||
|
||
dc.utils._API_ROOT = 'http://api.datacommons.org' | ||
|
||
|
||
def dc_api_wrapper(function, | ||
args: dict, | ||
retries: int = 3, | ||
retry_secs: int = 5, | ||
use_cache: bool = False, | ||
api_root: str = None): | ||
'''Returns the result from the DC APi call function. | ||
Retries the function in case of errors. | ||
|
||
Args: | ||
function: The DataCOmmons API function. | ||
args: dictionary with ann the arguments fr the DataCommons APi function. | ||
retries: Number of retries in case of HTTP errors. | ||
retry_sec: Interval in seconds between retries for which caller is blocked. | ||
use_cache: If True, uses request cache for faster response. | ||
api_root: The API server to use. Default is 'http://api.datacommons.org'. | ||
To use autopush with more recent data, set it to 'http://autopush.api.datacommons.org' | ||
|
||
'Returns: | ||
The response from the DataCommons API call. | ||
''' | ||
if api_root: | ||
dc.utils._API_ROOT = api_root | ||
logging.debug(f'Setting DC API root to {api_root} for {function}') | ||
if not retries or retries <= 0: | ||
retries = 1 | ||
if not requests_cache.is_installed(): | ||
requests_cache.install_cache(expires_after=300) | ||
cache_context = None | ||
if use_cache: | ||
cache_context = requests_cache.enabled() | ||
logging.debug(f'Using requests_cache for DC API {function}') | ||
else: | ||
cache_context = requests_cache.disabled() | ||
logging.debug(f'Using requests_cache for DC API {function}') | ||
with cache_context: | ||
for attempt in range(0, retries): | ||
try: | ||
logging.debug( | ||
f'Invoking DC API {function} with {args}, retries={retries}' | ||
) | ||
return function(**args) | ||
except KeyError: | ||
# Exception in case of API error. | ||
return None | ||
except urllib.error.URLError: | ||
# Exception when server is overloaded, retry after a delay | ||
if attempt >= retries: | ||
raise RuntimeError | ||
else: | ||
logging.debug( | ||
f'Retrying API {function} after {retry_secs}...') | ||
time.sleep(retry_secs) | ||
return None | ||
|
||
|
||
def dc_api_batched_wrapper(function, dcids: list, args: dict, | ||
config: dict) -> dict: | ||
'''Returns the dictionary result for the function cal on all APIs. | ||
It batches the dcids to make multiple calls to the DC API and merges all results. | ||
|
||
Args: | ||
function: DC API to be invoked. It should have dcids as one of the arguments | ||
and should return a dictionary with dcid as the key. | ||
dcids: List of dcids to be invoked with the function. | ||
args: Additional arguments for the function call. | ||
config: dictionary of DC API configuration settings. | ||
The supported settings are: | ||
dc_api_batch_size: Number of dcids to invoke per API call. | ||
dc_api_retries: Number of times an API can be retried. | ||
dc_api_retry_sec: Interval in seconds between retries. | ||
dc_api_use_cache: Enable/disable request cache for the DC API call. | ||
dc_api_root: The server to use fr the DC API calls. | ||
|
||
Returns: | ||
Merged function return values across all dcids. | ||
''' | ||
api_result = {} | ||
index = 0 | ||
num_dcids = len(dcids) | ||
api_batch_size = config.get('dc_api_batch_size', 10) | ||
logging.info( | ||
f'Calling DC API {function} on {len(dcids)} dcids in batches of {api_batch_size} with args: {args}...' | ||
) | ||
while index < num_dcids: | ||
# dcids in batches. | ||
dcids_batch = [ | ||
# strip_namespace(x) for x in dcids[index:index + api_batch_size] | ||
x for x in dcids[index:index + api_batch_size] | ||
] | ||
index += api_batch_size | ||
args['dcids'] = dcids_batch | ||
batch_result = dc_api_wrapper(function, args, | ||
config.get('dc_api_retries', 3), | ||
config.get('dc_api_retry_secs', 5), | ||
config.get('dc_api_use_cache', False), | ||
config.get('dc_api_root', None)) | ||
if batch_result: | ||
api_result.update(batch_result) | ||
logging.debug(f'Got DC API result for {function}: {batch_result}') | ||
return api_result | ||
|
||
|
||
def dc_api_get_defined_dcids(dcids: list, config: dict) -> dict: | ||
'''Returns a dict with dcids mapped to list of types in the DataCommons KG. | ||
Uses the property_value API to lookup 'typeOf' for each dcid. | ||
dcids not defined in KG are dropped in the response. | ||
''' | ||
api_function = dc.get_property_values | ||
args = { | ||
'prop': 'typeOf', | ||
'out': True, | ||
} | ||
api_result = dc_api_batched_wrapper(api_function, dcids, args, config) | ||
response = {} | ||
# Remove empty results for dcids not defined in KG. | ||
for dcid in dcids: | ||
if dcid in api_result and api_result[dcid]: | ||
response[dcid] = True | ||
else: | ||
response[dcid] = False | ||
return response |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please keep only one README.md file and mention download steps there which you wrote in "README_Download.md" file.