-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update/integration iqair devices #3995
Update/integration iqair devices #3995
Conversation
📝 WalkthroughWalkthroughThe pull request introduces several modifications to the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (5)
src/workflows/airqo_etl_utils/schema/bam_measurements.json (1)
7-11
: Consider adding a description field for the network attribute.The addition of the
network
field is consistent with the schema updates across other measurement files. However, unlike other fields (e.g., pm2_5, no2), this field lacks a description of its purpose and expected values.Consider adding a description field to improve schema documentation:
{ "name": "network", "type": "STRING", "mode": "NULLABLE", + "description": "Identifies the network associated with the BAM measurement (e.g., 'airqo', 'iqair')." },
src/workflows/airqo_etl_utils/schema/mobile_measurements.json (1)
7-11
: Schema changes are consistent but consider documenting the network values.The addition of the
network
field maintains consistency with other measurement schemas. However, to ensure proper data validation and usage:
- Consider documenting the allowed/expected network values
- Consider adding data validation rules if specific network values are expected
Consider creating a centralized schema documentation that defines:
- Valid network values across all schemas
- Relationships between networks in different schemas
- Data validation rules for network values
src/workflows/airqo_etl_utils/data_validator.py (1)
Line range hint
761-780
: Network-specific data processing looks good, but consider extracting the PM calculations.The changes correctly handle:
- Timestamp conversion when outliers are not removed
- Network-specific mean calculations for PM2.5 and PM10
However, the PM calculations could be extracted into a separate method for better maintainability.
Consider extracting the PM calculations into a dedicated method:
- is_airqo_network = data["network"] == "airqo" - - pm2_5_mean = data.loc[is_airqo_network, ["s1_pm2_5", "s2_pm2_5"]].mean( - axis=1 - ) - pm10_mean = data.loc[is_airqo_network, ["s1_pm10", "s2_pm10"]].mean(axis=1) - - data.loc[is_airqo_network, "pm2_5_raw_value"] = pm2_5_mean - data.loc[is_airqo_network, "pm2_5"] = pm2_5_mean - data.loc[is_airqo_network, "pm10_raw_value"] = pm10_mean - data.loc[is_airqo_network, "pm10"] = pm10_mean + data = self._calculate_network_pm_means(data) + + @staticmethod + def _calculate_network_pm_means(data: pd.DataFrame) -> pd.DataFrame: + """Calculate PM2.5 and PM10 means for AirQo network data.""" + is_airqo_network = data["network"] == "airqo" + + pm2_5_mean = data.loc[is_airqo_network, ["s1_pm2_5", "s2_pm2_5"]].mean(axis=1) + pm10_mean = data.loc[is_airqo_network, ["s1_pm10", "s2_pm10"]].mean(axis=1) + + data.loc[is_airqo_network, "pm2_5_raw_value"] = pm2_5_mean + data.loc[is_airqo_network, "pm2_5"] = pm2_5_mean + data.loc[is_airqo_network, "pm10_raw_value"] = pm10_mean + data.loc[is_airqo_network, "pm10"] = pm10_mean + + return datasrc/workflows/airqo_etl_utils/schema/latest_measurements.json (1)
7-11
: Schema change is consistent with related tables.The network field addition maintains consistency with other measurement schemas, which is crucial for data warehouse design.
Consider documenting the relationship between tenant and network fields in the schema, as they appear to be related hierarchical identifiers.
src/workflows/airqo_etl_utils/schema/data_warehouse.json (1)
7-11
: Consider adding field description for network context.While the field addition is correct, adding a description would help clarify its purpose and expected values, similar to how other fields in this schema have descriptions.
Consider adding a description like this:
{ "name": "network", "type": "STRING", - "mode": "NULLABLE" + "mode": "NULLABLE", + "description": "Identifies the network to which the device or measurement belongs" }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (8)
src/workflows/airqo_etl_utils/airqo_utils.py
(2 hunks)src/workflows/airqo_etl_utils/data_validator.py
(1 hunks)src/workflows/airqo_etl_utils/schema/airqo_mobile_measurements.json
(1 hunks)src/workflows/airqo_etl_utils/schema/bam_measurements.json
(2 hunks)src/workflows/airqo_etl_utils/schema/data_warehouse.json
(1 hunks)src/workflows/airqo_etl_utils/schema/devices.json
(2 hunks)src/workflows/airqo_etl_utils/schema/latest_measurements.json
(1 hunks)src/workflows/airqo_etl_utils/schema/mobile_measurements.json
(2 hunks)
🔇 Additional comments (3)
src/workflows/airqo_etl_utils/data_validator.py (1)
75-78
: Improved timestamp handling looks good!
The changes enhance timezone handling by:
- Including the '+' character in the regex pattern to support timezone offset notation
- Ensuring UTC conversion with
utc=True
Let's verify the timestamp formats in the codebase:
✅ Verification successful
Timestamp handling changes are correctly implemented and safe
The verification confirms that:
- The regex pattern change to include '+' is compatible with all timestamp formats in the codebase, which primarily use 'Z' suffix (UTC) format
- The
utc=True
parameter inpd.to_datetime()
is consistently used across the codebase for timezone handling - The changes maintain backward compatibility while improving timezone offset support
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for timestamp patterns to ensure compatibility
rg -g '*.{py,json}' '[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}[+-][0-9]{4}'
Length of output: 89
Script:
#!/bin/bash
# Let's search for different timestamp patterns in the codebase
# Search for ISO format timestamps with Z suffix
rg -g '*.{py,json}' '[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.?[0-9]*Z'
# Search for timestamp string handling in Python files
rg -g '*.py' 'to_datetime.*utc'
# Look for datetime format patterns
rg -g '*.py' 'datetime.*format|strftime|strptime'
# Check the data_validator.py file for context around the timestamp handling
cat src/workflows/airqo_etl_utils/data_validator.py
Length of output: 35916
src/workflows/airqo_etl_utils/airqo_utils.py (1)
1043-1043
: Network-specific device retrieval looks good!
The change ensures consistency by explicitly filtering for AirQo network devices.
src/workflows/airqo_etl_utils/schema/data_warehouse.json (1)
2-337
: Verify data consistency across related tables.
The schema changes look good overall, but we should ensure data consistency across related tables.
Let's check for related schema files that might need similar updates:
✅ Verification successful
The previous search revealed several measurement-related schema files. Let's examine the content of the most relevant ones to verify schema consistency and check for any ETL code that might be using these schemas. This will help us determine if the schema changes need to be propagated to other files.
After examining the schema files, I notice some differences in field presence and structure between data_warehouse.json and the related measurement schemas. We should check how these schemas are used in the ETL code and any data transformation logic to ensure the changes are properly handled.
Schema changes are consistent with related tables
After analyzing the schema files and ETL code, I can confirm that the changes in data_warehouse.json are consistent with the schema structure used across related measurement tables. The BigQuery API validates schema consistency during data loading, and the schema changes align with the existing field types and modes in latest_measurements.json and other measurement tables.
Key points verified:
- Schema validation is enforced through the
validate_data()
method in bigquery_api.py - Field types (STRING, FLOAT, TIMESTAMP) are consistent across measurement tables
- All fields are consistently marked as NULLABLE
- Common measurement fields (pm2_5, pm10, etc.) maintain consistent data types and descriptions
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for other measurement-related schema files
fd -e json . | grep -i "measurements\|devices"
# Search for any hardcoded references to these table schemas
rg -l "data_warehouse.json\|latest_measurements.json\|airqo_mobile_measurements.json"
Length of output: 736
Script:
#!/bin/bash
# Let's check the content of these schema files to compare their structure
for file in src/workflows/airqo_etl_utils/schema/{latest_measurements,measurements,airqo_mobile_measurements}.json; do
echo "=== $file ==="
cat "$file"
echo -e "\n\n"
done
# Also check for any ETL code that might be using these schemas
rg -l "schema.*json" src/workflows/airqo_etl_utils/
Length of output: 12508
Script:
#!/bin/bash
# Let's check how these schemas are being used in the ETL code
echo "=== Checking bigquery_api.py for schema usage ==="
cat src/workflows/airqo_etl_utils/bigquery_api.py
# Let's also check if there are any data transformation functions that might be affected
rg -l "transform.*data|process.*measurements" src/workflows/airqo_etl_utils/
Length of output: 35405
{ | ||
"name": "network", | ||
"type": "STRING", | ||
"mode": "REQUIRED" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Migration strategy needed for REQUIRED network field
Based on the codebase analysis, while there is code handling network-related operations, there's no explicit data migration strategy for the new REQUIRED network field. Critical observations:
- The API code (
airqo_api.py
) shows network is used in device retrieval operations - The utils code (
airqo_utils.py
) shows network field is accessed with a fallback:device.get("network", None)
- No SQL migration scripts were found to update existing records
- No default network assignment logic exists for historical data
Action items needed:
- Create a data migration script to populate the network field for existing device records
- Update the device retrieval logic to handle the REQUIRED constraint
- Remove the
get("network", None)
fallback as it won't be needed with REQUIRED field
🔗 Analysis chain
Verify data migration strategy for REQUIRED network field.
While making the network
field REQUIRED in the devices schema is logically correct (as every device should belong to a network), this change requires careful consideration:
- Existing devices in the database will need to have this field populated
- Default values may need to be established for historical data
Let's verify the potential impact on existing device records:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for existing device records that might be affected by the new REQUIRED field
# Search for any BigQuery schema migration scripts
fd -t f -e sql . | xargs rg -l "ALTER.*devices.*network"
# Search for any data migration utilities
rg -l "devices.*network" --type py
Length of output: 219
Script:
#!/bin/bash
# Let's examine the content of these files to check for any data migration logic
# Check airqo_utils.py for device-related functions
rg -A 5 "devices.*network" src/workflows/airqo_etl_utils/airqo_utils.py
# Check airqo_api.py for network field handling
rg -A 5 "devices.*network" src/workflows/airqo_etl_utils/airqo_api.py
# Look for any default network values or migration logic
ast-grep --pattern 'network = $_'
# Search for any BigQuery update statements
rg -i "update.*devices.*set.*network"
Length of output: 2317
"name": "network", | ||
"type": "STRING", | ||
"mode": "NULLABLE" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Schema update is safe, but ETL code needs updating
The ETL pipeline uses dynamic column population through populate_missing_columns
, which safely handles new fields. However, there are a few places that need attention:
- Update
data_columns
list inairqo_utils.py
to include the new 'network' field for comprehensive data processing - Review column mappings in
AirQoDataUtils.Device_Field_Mapping
to ensure proper handling of the network field - Consider adding network field population logic in data transformation sections where tenant is being set
🔗 Analysis chain
Network field addition looks good, but consider data migration impact.
The addition of the network field is well-structured and consistent with other schema files. The placement after the tenant field is logical from a data organization perspective.
Let's verify if there are any existing ETL jobs that might need updating:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for ETL jobs that might need updating for the new network field
rg -l "airqo_mobile_measurements" | grep -i "etl\|pipeline"
Length of output: 190
Script:
#!/bin/bash
# Check the content of ETL files for mobile measurements processing
echo "=== Checking airqo_utils.py ==="
rg -A 5 "airqo_mobile_measurements" src/workflows/airqo_etl_utils/airqo_utils.py
echo -e "\n=== Checking bigquery_api.py ==="
rg -A 5 "airqo_mobile_measurements" src/workflows/airqo_etl_utils/bigquery_api.py
echo -e "\n=== Checking config.py ==="
rg -A 5 "airqo_mobile_measurements" src/workflows/airqo_etl_utils/config.py
Length of output: 1496
Script:
#!/bin/bash
# Let's check for any data transformation or processing logic that might need updating
echo "=== Checking for data processing logic in airqo_utils.py ==="
ast-grep --pattern 'def $_(self, $$$) {
$$$
airqo_mobile_measurements
$$$
}' src/workflows/airqo_etl_utils/airqo_utils.py
# Also check for any direct column references
echo -e "\n=== Checking for column mappings or transformations ==="
rg -B2 -A2 "columns\s*=|schema\s*=" src/workflows/airqo_etl_utils/airqo_utils.py
Length of output: 3135
Description
Clean up datetime conversion and clean up
Summary by CodeRabbit
New Features
Bug Fixes
Documentation