NCES combined pr

datacommonsorg · Dec 18, 2024 · 4fc488c · 4fc488c
1 parent 5c49edd
commit 4fc488c
Show file tree

Hide file tree

Showing 51 changed files with 37,170 additions and 0 deletions.
diff --git a/scripts/us_nces/README.md b/scripts/us_nces/README.md
@@ -0,0 +1,176 @@
+# US: National Center for Education Statistics
+
+## About the Dataset
+This dataset has Population Estimates for the National Center for Education Statistics in US for 
+- Private School - 1997-2019
+- School District - 2010-2023
+- Public Schools - 2010-2023
+
+The population is categorized on various attributes and their combinations:
+
+        1. Count of Students categorised based on Race.
+        2. Count of Students categorised based on School Grade from Pre Kindergarten to Grade 13.
+        3. Count of Students categorised based on Pupil/Teacher Ratio.
+        4. Count of Full-Time Equivalent (FTE) Teachers. 
+        5. Count of Students with a combination of Race and School Grade. 
+        6. Count of Students with a combination of Race and Gender.
+        7. Count of Students with a combination of School Grade and Gender.
+        8. Count of Students with a combination of Race, School Grade and Gender.
+        9. Count of Students under Ungraded Classes, Ungraded Students and Adult Education Students.
+
+The Place properties of Schools are given as below:
+    (All the properties are available based on the type of School)
+
+    1. `County Number`
+    2. `School Name`
+    3. `School ID - NCES Assigned`
+    4. `Lowest Grade Offered`
+    5. `Highest Grade Offered`
+    6. `Phone Number`
+    7. `ANSI/FIPS State Code`
+    8. `Location Address 1`
+    9. `Location City`
+    10. `Location ZIP`
+    11. `Magnet School`
+    12. `Charter School`
+    13. `School Type`
+    14. `Title I School Status`
+    15. `National School Lunch Program`
+    16. `School Level (SY 2017-18 onward)`
+    17. `State School ID`
+    18. `State Agency ID`
+    19. `State Abbr`
+    20. `Agency Name`
+    21. `Location ZIP4`
+    22. `Agency ID - NCES Assigned`
+    23. `School Level`
+    24. `State Name`
+        
+
+### Download URL
+The data in .csv formats are downloadable from https://nces.ed.gov/ccd/elsi/tableGenerator.aspx -> 	.
+
+
+#### API Output
+The attributes used for the import are as follows
+| Attribute      					| Description                                               |
+|-------------------------------------------------------|------------------------------------------------------------------------------------------------|
+| time       					| The Year of the population estimates provided. 				|
+| geo       					| The Area of the population estimates provided. 				|
+| Race  		        		| The Number of Students categorised under race. 				|
+| Gender                      	| The Number of Students categorised under Gender.              |
+| School Grade  	        	| The Number of Students categorised under School Grade. 		|
+| Full-Time Equivalent 			| The Count of Teachers Available.						        |
+
+
+
+### Import Procedure
+
+#### Downloading the input files using scripts.
+    - There are 3 scripts created to download the input files.
+    - fetch_ncid.py
+    - download_config.py
+    - download_file_details.py
+    - download.py
+
+    #### fetch_ncid.py script
+     - The code is a Python function that retrieves National Center for Education Statistics (NCES) IDs for a given school and year. It automates interacting with a webpage using Selenium to select options and then extracts the corresponding IDs.
+
+    ##### download_config.py script  
+     - The download_config.py script has all the configurations required to download the file.
+     - It has filter, download url, maximum tries etc. The values are same under all cases.
+
+    ##### download_config.py script
+     - The download_file_details.py script has values for "default column", "columns to be downloaded" and "key coulmns".
+     - Every input file can only accommodate 60 columns. In Public Schools multiple input files will be downloaded. All these input files will have a common column called as "Key Column" which acts as primary key.
+     - In the "Public columns to be downloaded" create a list of columns.
+        -ex: PUBLIC_COLUMNS = ["State Name [Public School]", "State Abbr [Public School]", "School Name [Public School]"]
+     - Steps to add columns to the list.
+        - Under "Select Table Columns" 
+        - select the "Information" tab 
+        - expand the hit area "BasicInformation" 
+        - right click on the desired column checkbox and choose inspect 
+        - from the elements on the right hand side, check the number assigned to "value" and add confirm the column under that list which corresponds to value.
+
+    ##### download.py script
+     - The download.py script is the main script. It considers the import_name and year to be downloaded. It downloads, extracts and places the input csv in "input_files" folder under the desired school directory.
+
+
+### Command to Download input file
+  - `/bin/python3 scripts/us_nces/demographics/download.py --import_name={"PrivateSchool"(or)"District"(or)"PublicSchool"} --years_to_download= "{select the available years mentioned under each school type}"`
+
+    For Example:  `/bin/python3 scripts/us_nces/demographics/download.py --import_name="PublicSchool" --years_to_download="2023"`.
+    - The input_files folder containing all the files will be present in: 
+    `scripts/us_nces/demographics/public_school/input_files`
+ - Note: Give one year at a time for District and Public Schools as there are large number of column values.
+
+
+#### Cleaned Data
+import_name consists of the school name being used 
+- "private_school"
+- "district_school"
+- "public_school"
+
+Cleaned data will be saved as a CSV file within the following paths.
+- private_school -> [private_school/output_files/us_nces_demographics_private_school.csv]
+- district_school -> [school_district/output_files/us_nces_demographics_district_school.csv]
+- public_school -> [public_school/output_files/us_nces_demographics_public_school.csv]
+
+The Columns for the csv files are as follows
+- school_state_code 
+- year
+- sv_name
+- observation
+- scaling_factor
+- unit
+
+
+#### MCFs and Template MCFs
+- private_school -> [private_school/output_files/us_nces_demographics_private_school.mcf]
+                    [private_school/output_files/us_nces_demographics_private_school.tmcf]
+
+
+- district_school -> [school_district/output_files/us_nces_demographics_district_school.mcf],
+                     [school_district/output_files/us_nces_demographics_district_school.tmcf]
+
+
+- public_school ->  [public_school/output_files/us_nces_demographics_public_school.mcf],
+                    [public_school/output_files/us_nces_demographics_public_school.tmcf]
+
+
+#### Cleaned Place
+"import_name" consists the type of school being executed. 
+- "private_school"
+- "district_school"
+- "public_school"
+
+Cleaned data will be inside as a CSV file with the following paths.
+- private_school:
+[private_school/output_place/us_nces_demographics_private_place.csv]
+- district_school:
+[school_district/output_place/us_nces_demographics_district_place.csv]
+- public_school:
+[public_school/output_place/us_nces_demographics_public_place.csv]
+
+If there are Duplicate School IDs present in School Place, they will be saved inside the same output path as that of csv and tmcf file.
+- [scripts/us_nces/demographics/private_school/output_place/dulicate_id_us_nces_demographics_private_place.csv]
+- [scripts/us_nces/demographics/school_district/output_place/dulicate_id_us_nces_demographics_district_place.csv]
+- [scripts/us_nces/demographics/public_school/output_place/dulicate_id_us_nces_demographics_public_place.csv]
+
+
+#### Template MCFs Place
+- private_school:
+[private_school/output_place/us_nces_demographics_private_place.tmcf]
+- district_school:
+[school_district/output_place/us_nces_demographics_district_place.tmcf]
+- public_school:
+[public_school/output_place/us_nces_demographics_public_place.tmcf]
+
+### Running Tests
+
+Run the test cases
+
+- `/bin/python3 -m unittest scripts/us_nces/demographics/private_school/process_test.py`
+- `/bin/python3 -m unittest scripts/us_nces/demographics/school_district/process_test.py`
+- `/bin/python3 -m unittest scripts/us_nces/demographics/public_school/process_test.py`
+
diff --git a/scripts/us_nces/common/dcid__mcf_existance.py b/scripts/us_nces/common/dcid__mcf_existance.py
@@ -0,0 +1,44 @@
+# Copyright 2022 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""_summary_
+Script to check the property/dcid/nodes existance in datacommons.org.
+"""
+import datacommons
+
+
+def check_dcid_existence(nodes: list) -> dict:
+    """
+    Checks the existance of dcid nodes in autopush.
+    True: Available
+    False: Unavailable
+    Args:
+        nodes (list): Dcid Nodes List
+    Returns:
+        dict: Status dictionary.
+    """
+    # pylint: disable=protected-access
+    nodes_response = datacommons.get_property_values(
+        nodes,
+        "typeOf",
+        out=True,
+        value_type=None,
+        limit=datacommons.utils._MAX_LIMIT)
+    # pylint: enable=protected-access
+    node_status = {}
+    for node, value in nodes_response.items():
+        if value == []:
+            node_status[node] = False
+        else:
+            node_status[node] = True
+    return node_status
diff --git a/scripts/us_nces/common/dcid_existance.py b/scripts/us_nces/common/dcid_existance.py
@@ -0,0 +1,131 @@
+import datacommons as dc
+import time
+import logging
+import requests_cache
+import urllib
+
+dc.utils._API_ROOT = 'http://api.datacommons.org'
+
+
+def dc_api_wrapper(function,
+                   args: dict,
+                   retries: int = 3,
+                   retry_secs: int = 5,
+                   use_cache: bool = False,
+                   api_root: str = None):
+    '''Returns the result from the DC APi call function.
+    Retries the function in case of errors.
+
+    Args:
+      function: The DataCOmmons API function.
+      args: dictionary with ann the arguments fr the DataCommons APi function.
+      retries: Number of retries in case of HTTP errors.
+      retry_sec: Interval in seconds between retries for which caller is blocked.
+      use_cache: If True, uses request cache for faster response.
+      api_root: The API server to use. Default is 'http://api.datacommons.org'.
+         To use autopush with more recent data, set it to 'http://autopush.api.datacommons.org'
+
+    'Returns:
+      The response from the DataCommons API call.
+    '''
+    if api_root:
+        dc.utils._API_ROOT = api_root
+        logging.debug(f'Setting DC API root to {api_root} for {function}')
+    if not retries or retries <= 0:
+        retries = 1
+    if not requests_cache.is_installed():
+        requests_cache.install_cache(expires_after=300)
+    cache_context = None
+    if use_cache:
+        cache_context = requests_cache.enabled()
+        logging.debug(f'Using requests_cache for DC API {function}')
+    else:
+        cache_context = requests_cache.disabled()
+        logging.debug(f'Using requests_cache for DC API {function}')
+    with cache_context:
+        for attempt in range(0, retries):
+            try:
+                logging.debug(
+                    f'Invoking DC API {function} with {args}, retries={retries}'
+                )
+                return function(**args)
+            except KeyError:
+                # Exception in case of API error.
+                return None
+            except urllib.error.URLError:
+                # Exception when server is overloaded, retry after a delay
+                if attempt >= retries:
+                    raise RuntimeError
+                else:
+                    logging.debug(
+                        f'Retrying API {function} after {retry_secs}...')
+                    time.sleep(retry_secs)
+    return None
+
+
+def dc_api_batched_wrapper(function, dcids: list, args: dict,
+                           config: dict) -> dict:
+    '''Returns the dictionary result for the function cal on all APIs.
+  It batches the dcids to make multiple calls to the DC API and merges all results.
+
+  Args:
+    function: DC API to be invoked. It should have dcids as one of the arguments
+      and should return a dictionary with dcid as the key.
+    dcids: List of dcids to be invoked with the function.
+    args: Additional arguments for the function call.
+    config: dictionary of DC API configuration settings.
+      The supported settings are:
+        dc_api_batch_size: Number of dcids to invoke per API call.
+        dc_api_retries: Number of times an API can be retried.
+        dc_api_retry_sec: Interval in seconds between retries.
+        dc_api_use_cache: Enable/disable request cache for the DC API call.
+        dc_api_root: The server to use fr the DC API calls.
+
+  Returns:
+    Merged function return values across all dcids.
+  '''
+    api_result = {}
+    index = 0
+    num_dcids = len(dcids)
+    api_batch_size = config.get('dc_api_batch_size', 10)
+    logging.info(
+        f'Calling DC API {function} on {len(dcids)} dcids in batches of {api_batch_size} with args: {args}...'
+    )
+    while index < num_dcids:
+        #  dcids in batches.
+        dcids_batch = [
+            # strip_namespace(x) for x in dcids[index:index + api_batch_size]
+            x for x in dcids[index:index + api_batch_size]
+        ]
+        index += api_batch_size
+        args['dcids'] = dcids_batch
+        batch_result = dc_api_wrapper(function, args,
+                                      config.get('dc_api_retries', 3),
+                                      config.get('dc_api_retry_secs', 5),
+                                      config.get('dc_api_use_cache', False),
+                                      config.get('dc_api_root', None))
+        if batch_result:
+            api_result.update(batch_result)
+            logging.debug(f'Got DC API result for {function}: {batch_result}')
+    return api_result
+
+
+def dc_api_get_defined_dcids(dcids: list, config: dict) -> dict:
+    '''Returns a dict with dcids mapped to list of types in the DataCommons KG.
+       Uses the property_value API to lookup 'typeOf' for each dcid.
+       dcids not defined in KG are dropped in the response.
+    '''
+    api_function = dc.get_property_values
+    args = {
+        'prop': 'typeOf',
+        'out': True,
+    }
+    api_result = dc_api_batched_wrapper(api_function, dcids, args, config)
+    response = {}
+    # Remove empty results for dcids not defined in KG.
+    for dcid in dcids:
+        if dcid in api_result and api_result[dcid]:
+            response[dcid] = True
+        else:
+            response[dcid] = False
+    return response