Skip to content

Commit

Permalink
NCES combined pr
Browse files Browse the repository at this point in the history
  • Loading branch information
shapateriya committed Dec 18, 2024
1 parent 5c49edd commit 4fc488c
Show file tree
Hide file tree
Showing 51 changed files with 37,170 additions and 0 deletions.
176 changes: 176 additions & 0 deletions scripts/us_nces/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# US: National Center for Education Statistics

## About the Dataset
This dataset has Population Estimates for the National Center for Education Statistics in US for
- Private School - 1997-2019
- School District - 2010-2023
- Public Schools - 2010-2023

The population is categorized on various attributes and their combinations:

1. Count of Students categorised based on Race.
2. Count of Students categorised based on School Grade from Pre Kindergarten to Grade 13.
3. Count of Students categorised based on Pupil/Teacher Ratio.
4. Count of Full-Time Equivalent (FTE) Teachers.
5. Count of Students with a combination of Race and School Grade.
6. Count of Students with a combination of Race and Gender.
7. Count of Students with a combination of School Grade and Gender.
8. Count of Students with a combination of Race, School Grade and Gender.
9. Count of Students under Ungraded Classes, Ungraded Students and Adult Education Students.

The Place properties of Schools are given as below:
(All the properties are available based on the type of School)

1. `County Number`
2. `School Name`
3. `School ID - NCES Assigned`
4. `Lowest Grade Offered`
5. `Highest Grade Offered`
6. `Phone Number`
7. `ANSI/FIPS State Code`
8. `Location Address 1`
9. `Location City`
10. `Location ZIP`
11. `Magnet School`
12. `Charter School`
13. `School Type`
14. `Title I School Status`
15. `National School Lunch Program`
16. `School Level (SY 2017-18 onward)`
17. `State School ID`
18. `State Agency ID`
19. `State Abbr`
20. `Agency Name`
21. `Location ZIP4`
22. `Agency ID - NCES Assigned`
23. `School Level`
24. `State Name`

### Download URL
The data in .csv formats are downloadable from https://nces.ed.gov/ccd/elsi/tableGenerator.aspx -> .


#### API Output
The attributes used for the import are as follows
| Attribute | Description |
|-------------------------------------------------------|------------------------------------------------------------------------------------------------|
| time | The Year of the population estimates provided. |
| geo | The Area of the population estimates provided. |
| Race | The Number of Students categorised under race. |
| Gender | The Number of Students categorised under Gender. |
| School Grade | The Number of Students categorised under School Grade. |
| Full-Time Equivalent | The Count of Teachers Available. |



### Import Procedure

#### Downloading the input files using scripts.
- There are 3 scripts created to download the input files.
- fetch_ncid.py
- download_config.py
- download_file_details.py
- download.py

#### fetch_ncid.py script
- The code is a Python function that retrieves National Center for Education Statistics (NCES) IDs for a given school and year. It automates interacting with a webpage using Selenium to select options and then extracts the corresponding IDs.

##### download_config.py script
- The download_config.py script has all the configurations required to download the file.
- It has filter, download url, maximum tries etc. The values are same under all cases.

##### download_config.py script
- The download_file_details.py script has values for "default column", "columns to be downloaded" and "key coulmns".
- Every input file can only accommodate 60 columns. In Public Schools multiple input files will be downloaded. All these input files will have a common column called as "Key Column" which acts as primary key.
- In the "Public columns to be downloaded" create a list of columns.
-ex: PUBLIC_COLUMNS = ["State Name [Public School]", "State Abbr [Public School]", "School Name [Public School]"]
- Steps to add columns to the list.
- Under "Select Table Columns"
- select the "Information" tab
- expand the hit area "BasicInformation"
- right click on the desired column checkbox and choose inspect
- from the elements on the right hand side, check the number assigned to "value" and add confirm the column under that list which corresponds to value.

##### download.py script
- The download.py script is the main script. It considers the import_name and year to be downloaded. It downloads, extracts and places the input csv in "input_files" folder under the desired school directory.


### Command to Download input file
- `/bin/python3 scripts/us_nces/demographics/download.py --import_name={"PrivateSchool"(or)"District"(or)"PublicSchool"} --years_to_download= "{select the available years mentioned under each school type}"`

For Example: `/bin/python3 scripts/us_nces/demographics/download.py --import_name="PublicSchool" --years_to_download="2023"`.
- The input_files folder containing all the files will be present in:
`scripts/us_nces/demographics/public_school/input_files`
- Note: Give one year at a time for District and Public Schools as there are large number of column values.


#### Cleaned Data
import_name consists of the school name being used
- "private_school"
- "district_school"
- "public_school"

Cleaned data will be saved as a CSV file within the following paths.
- private_school -> [private_school/output_files/us_nces_demographics_private_school.csv]
- district_school -> [school_district/output_files/us_nces_demographics_district_school.csv]
- public_school -> [public_school/output_files/us_nces_demographics_public_school.csv]

The Columns for the csv files are as follows
- school_state_code
- year
- sv_name
- observation
- scaling_factor
- unit


#### MCFs and Template MCFs
- private_school -> [private_school/output_files/us_nces_demographics_private_school.mcf]
[private_school/output_files/us_nces_demographics_private_school.tmcf]


- district_school -> [school_district/output_files/us_nces_demographics_district_school.mcf],
[school_district/output_files/us_nces_demographics_district_school.tmcf]


- public_school -> [public_school/output_files/us_nces_demographics_public_school.mcf],
[public_school/output_files/us_nces_demographics_public_school.tmcf]


#### Cleaned Place
"import_name" consists the type of school being executed.
- "private_school"
- "district_school"
- "public_school"

Cleaned data will be inside as a CSV file with the following paths.
- private_school:
[private_school/output_place/us_nces_demographics_private_place.csv]
- district_school:
[school_district/output_place/us_nces_demographics_district_place.csv]
- public_school:
[public_school/output_place/us_nces_demographics_public_place.csv]

If there are Duplicate School IDs present in School Place, they will be saved inside the same output path as that of csv and tmcf file.
- [scripts/us_nces/demographics/private_school/output_place/dulicate_id_us_nces_demographics_private_place.csv]
- [scripts/us_nces/demographics/school_district/output_place/dulicate_id_us_nces_demographics_district_place.csv]
- [scripts/us_nces/demographics/public_school/output_place/dulicate_id_us_nces_demographics_public_place.csv]


#### Template MCFs Place
- private_school:
[private_school/output_place/us_nces_demographics_private_place.tmcf]
- district_school:
[school_district/output_place/us_nces_demographics_district_place.tmcf]
- public_school:
[public_school/output_place/us_nces_demographics_public_place.tmcf]

### Running Tests

Run the test cases

- `/bin/python3 -m unittest scripts/us_nces/demographics/private_school/process_test.py`
- `/bin/python3 -m unittest scripts/us_nces/demographics/school_district/process_test.py`
- `/bin/python3 -m unittest scripts/us_nces/demographics/public_school/process_test.py`

44 changes: 44 additions & 0 deletions scripts/us_nces/common/dcid__mcf_existance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""_summary_
Script to check the property/dcid/nodes existance in datacommons.org.
"""
import datacommons


def check_dcid_existence(nodes: list) -> dict:
"""
Checks the existance of dcid nodes in autopush.
True: Available
False: Unavailable
Args:
nodes (list): Dcid Nodes List
Returns:
dict: Status dictionary.
"""
# pylint: disable=protected-access
nodes_response = datacommons.get_property_values(
nodes,
"typeOf",
out=True,
value_type=None,
limit=datacommons.utils._MAX_LIMIT)
# pylint: enable=protected-access
node_status = {}
for node, value in nodes_response.items():
if value == []:
node_status[node] = False
else:
node_status[node] = True
return node_status
131 changes: 131 additions & 0 deletions scripts/us_nces/common/dcid_existance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
import datacommons as dc
import time
import logging
import requests_cache
import urllib

dc.utils._API_ROOT = 'http://api.datacommons.org'


def dc_api_wrapper(function,
args: dict,
retries: int = 3,
retry_secs: int = 5,
use_cache: bool = False,
api_root: str = None):
'''Returns the result from the DC APi call function.
Retries the function in case of errors.
Args:
function: The DataCOmmons API function.
args: dictionary with ann the arguments fr the DataCommons APi function.
retries: Number of retries in case of HTTP errors.
retry_sec: Interval in seconds between retries for which caller is blocked.
use_cache: If True, uses request cache for faster response.
api_root: The API server to use. Default is 'http://api.datacommons.org'.
To use autopush with more recent data, set it to 'http://autopush.api.datacommons.org'
'Returns:
The response from the DataCommons API call.
'''
if api_root:
dc.utils._API_ROOT = api_root
logging.debug(f'Setting DC API root to {api_root} for {function}')
if not retries or retries <= 0:
retries = 1
if not requests_cache.is_installed():
requests_cache.install_cache(expires_after=300)
cache_context = None
if use_cache:
cache_context = requests_cache.enabled()
logging.debug(f'Using requests_cache for DC API {function}')
else:
cache_context = requests_cache.disabled()
logging.debug(f'Using requests_cache for DC API {function}')
with cache_context:
for attempt in range(0, retries):
try:
logging.debug(
f'Invoking DC API {function} with {args}, retries={retries}'
)
return function(**args)
except KeyError:
# Exception in case of API error.
return None
except urllib.error.URLError:
# Exception when server is overloaded, retry after a delay
if attempt >= retries:
raise RuntimeError
else:
logging.debug(
f'Retrying API {function} after {retry_secs}...')
time.sleep(retry_secs)
return None


def dc_api_batched_wrapper(function, dcids: list, args: dict,
config: dict) -> dict:
'''Returns the dictionary result for the function cal on all APIs.
It batches the dcids to make multiple calls to the DC API and merges all results.
Args:
function: DC API to be invoked. It should have dcids as one of the arguments
and should return a dictionary with dcid as the key.
dcids: List of dcids to be invoked with the function.
args: Additional arguments for the function call.
config: dictionary of DC API configuration settings.
The supported settings are:
dc_api_batch_size: Number of dcids to invoke per API call.
dc_api_retries: Number of times an API can be retried.
dc_api_retry_sec: Interval in seconds between retries.
dc_api_use_cache: Enable/disable request cache for the DC API call.
dc_api_root: The server to use fr the DC API calls.
Returns:
Merged function return values across all dcids.
'''
api_result = {}
index = 0
num_dcids = len(dcids)
api_batch_size = config.get('dc_api_batch_size', 10)
logging.info(
f'Calling DC API {function} on {len(dcids)} dcids in batches of {api_batch_size} with args: {args}...'
)
while index < num_dcids:
# dcids in batches.
dcids_batch = [
# strip_namespace(x) for x in dcids[index:index + api_batch_size]
x for x in dcids[index:index + api_batch_size]
]
index += api_batch_size
args['dcids'] = dcids_batch
batch_result = dc_api_wrapper(function, args,
config.get('dc_api_retries', 3),
config.get('dc_api_retry_secs', 5),
config.get('dc_api_use_cache', False),
config.get('dc_api_root', None))
if batch_result:
api_result.update(batch_result)
logging.debug(f'Got DC API result for {function}: {batch_result}')
return api_result


def dc_api_get_defined_dcids(dcids: list, config: dict) -> dict:
'''Returns a dict with dcids mapped to list of types in the DataCommons KG.
Uses the property_value API to lookup 'typeOf' for each dcid.
dcids not defined in KG are dropped in the response.
'''
api_function = dc.get_property_values
args = {
'prop': 'typeOf',
'out': True,
}
api_result = dc_api_batched_wrapper(api_function, dcids, args, config)
response = {}
# Remove empty results for dcids not defined in KG.
for dcid in dcids:
if dcid in api_result and api_result[dcid]:
response[dcid] = True
else:
response[dcid] = False
return response
Loading

0 comments on commit 4fc488c

Please sign in to comment.