Update/integration iqair devices #3995

NicholasTurner23 · 2024-12-04T11:24:04Z

Description

Clean up datetime conversion and clean up

Summary by CodeRabbit

New Features
- Enhanced data handling for timestamps and device data extraction.
- Added a new "network" field to multiple JSON schemas, allowing for better categorization of measurements and devices.
Bug Fixes
- Improved logic for cleaning low-cost sensor data and formatting data types.
Documentation
- Updated comments and docstrings for clarity in data processing methods.

Updates from airqo staging

coderabbitai · 2024-12-04T11:24:11Z

📝 Walkthrough

Walkthrough

The pull request introduces several modifications to the AirQoDataUtils and DataValidationUtils classes in the airqo_utils.py and data_validator.py files, respectively. Key updates include refined handling of the timestamp column in data processing methods and enhanced logic for device data extraction based on network specifications. Additionally, new fields related to the network have been added to various JSON schema files, ensuring that the data structures accommodate network-related information while maintaining their existing definitions.

Changes

File Path	Change Summary
`src/workflows/airqo_etl_utils/airqo_utils.py`	Modified `clean_low_cost_sensor_data` and `extract_devices_deployment_logs` methods for improved timestamp handling and device filtering by network.
`src/workflows/airqo_etl_utils/data_validator.py`	Updated `format_data_types` and `remove_outliers` methods for better timestamp formatting and validation.
`src/workflows/airqo_etl_utils/schema/airqo_mobile_measurements.json`	Added new field `network` of type `STRING`, mode `NULLABLE`.
`src/workflows/airqo_etl_utils/schema/bam_measurements.json`	Added new field `network` of type `STRING`, mode `NULLABLE`.
`src/workflows/airqo_etl_utils/schema/data_warehouse.json`	Added new field `network` of type `STRING`, mode `NULLABLE`.
`src/workflows/airqo_etl_utils/schema/devices.json`	Added new field `network` of type `STRING`, mode `REQUIRED`.
`src/workflows/airqo_etl_utils/schema/latest_measurements.json`	Added new field `network` of type `STRING`, mode `NULLABLE`.
`src/workflows/airqo_etl_utils/schema/mobile_measurements.json`	Added new field `network` of type `STRING`, mode `NULLABLE`.

Possibly related PRs

Update fix/reduce redundant device data called[wip] #3526: Changes in the get_devices method may relate to the modifications in the extract_devices_deployment_logs method, both involving device retrieval based on network specifications.
Update fix/clean up #3616: Modifications to the get_devices method are relevant to the changes in device data extraction and network handling.
Column name clean up #3618: Changes in the process_data_for_api method relate to the clean_low_cost_sensor_data method, both involving data processing and handling of device identifiers.
Update/kafka implementations #3752: Updates in the get_devices method align with the main PR's focus on network-based device data extraction.
Update/kafka implementations #3754: Introduction of the transform_devices method is relevant to the changes in data handling and processing.
Clean up/Sanitize #3782: Updates to the get_devices method include error handling improvements related to data extraction.
Update/integration iqair devices [WIP] #3950: Modifications to the get_devices_by_network method are relevant to the main PR's focus on device data extraction.
Update/integration iqair devices #3952: Changes in the get_devices method involve network handling, relating to the main PR's updates.
Update/integration iqair devices #3962: Alterations in the extract_devices_data method focus on network-based data extraction, directly related to the main PR.
Update/integration iqair devices #3964: Updates to the clean_low_cost_sensor_data method are relevant to the main PR regarding data processing.
Update/integration iqair devices #3981: Modifications regarding device data handling relate to enhancements in data extraction processes.
Update/integration iqair devices #3985: Changes in the process_data_for_message_broker method involve merging dataframes based on network, relating to the main PR.
Update/integration iqair devices #3988: Updates to the get_devices method involve network parameters, relevant to the main PR.
Update/integration iqair devices #3993: Modifications focusing on network functionality in the get_devices method are directly related to the main PR.

Suggested reviewers

Baalmart
BenjaminSsempala

🎉 In the realm of data, changes unfold,
With timestamps refined and stories retold.
Networks now woven in schemas so bright,
Enhancing the flow, bringing data to light.
From devices to logs, all neatly aligned,
A tapestry of insights, beautifully designed! 🌟

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (5)

src/workflows/airqo_etl_utils/schema/bam_measurements.json (1)
7-11: Consider adding a description field for the network attribute.

The addition of the network field is consistent with the schema updates across other measurement files. However, unlike other fields (e.g., pm2_5, no2), this field lacks a description of its purpose and expected values.

Consider adding a description field to improve schema documentation:
  {
    "name": "network",
    "type": "STRING",
    "mode": "NULLABLE",
+   "description": "Identifies the network associated with the BAM measurement (e.g., 'airqo', 'iqair')."
  },
src/workflows/airqo_etl_utils/schema/mobile_measurements.json (1)

7-11: Schema changes are consistent but consider documenting the network values.

The addition of the network field maintains consistency with other measurement schemas. However, to ensure proper data validation and usage:

Consider documenting the allowed/expected network values

Consider adding data validation rules if specific network values are expected

Consider creating a centralized schema documentation that defines:

Valid network values across all schemas

Relationships between networks in different schemas

Data validation rules for network values
src/workflows/airqo_etl_utils/data_validator.py (1)
Line range hint 761-780: Network-specific data processing looks good, but consider extracting the PM calculations.

The changes correctly handle:

Timestamp conversion when outliers are not removed

Network-specific mean calculations for PM2.5 and PM10

However, the PM calculations could be extracted into a separate method for better maintainability.

Consider extracting the PM calculations into a dedicated method:
-            is_airqo_network = data["network"] == "airqo"
-
-            pm2_5_mean = data.loc[is_airqo_network, ["s1_pm2_5", "s2_pm2_5"]].mean(
-                axis=1
-            )
-            pm10_mean = data.loc[is_airqo_network, ["s1_pm10", "s2_pm10"]].mean(axis=1)
-
-            data.loc[is_airqo_network, "pm2_5_raw_value"] = pm2_5_mean
-            data.loc[is_airqo_network, "pm2_5"] = pm2_5_mean
-            data.loc[is_airqo_network, "pm10_raw_value"] = pm10_mean
-            data.loc[is_airqo_network, "pm10"] = pm10_mean
+            data = self._calculate_network_pm_means(data)
+
+    @staticmethod
+    def _calculate_network_pm_means(data: pd.DataFrame) -> pd.DataFrame:
+        """Calculate PM2.5 and PM10 means for AirQo network data."""
+        is_airqo_network = data["network"] == "airqo"
+        
+        pm2_5_mean = data.loc[is_airqo_network, ["s1_pm2_5", "s2_pm2_5"]].mean(axis=1)
+        pm10_mean = data.loc[is_airqo_network, ["s1_pm10", "s2_pm10"]].mean(axis=1)
+        
+        data.loc[is_airqo_network, "pm2_5_raw_value"] = pm2_5_mean
+        data.loc[is_airqo_network, "pm2_5"] = pm2_5_mean
+        data.loc[is_airqo_network, "pm10_raw_value"] = pm10_mean
+        data.loc[is_airqo_network, "pm10"] = pm10_mean
+        
+        return data
src/workflows/airqo_etl_utils/schema/latest_measurements.json (1)

7-11: Schema change is consistent with related tables.

The network field addition maintains consistency with other measurement schemas, which is crucial for data warehouse design.

Consider documenting the relationship between tenant and network fields in the schema, as they appear to be related hierarchical identifiers.
src/workflows/airqo_etl_utils/schema/data_warehouse.json (1)
7-11: Consider adding field description for network context.

While the field addition is correct, adding a description would help clarify its purpose and expected values, similar to how other fields in this schema have descriptions.

Consider adding a description like this:
  {
    "name": "network",
    "type": "STRING",
-    "mode": "NULLABLE"
+    "mode": "NULLABLE",
+    "description": "Identifies the network to which the device or measurement belongs"
  }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 50c8a62 and 9b2d45b.

📒 Files selected for processing (8)

src/workflows/airqo_etl_utils/airqo_utils.py (2 hunks)
src/workflows/airqo_etl_utils/data_validator.py (1 hunks)
src/workflows/airqo_etl_utils/schema/airqo_mobile_measurements.json (1 hunks)
src/workflows/airqo_etl_utils/schema/bam_measurements.json (2 hunks)
src/workflows/airqo_etl_utils/schema/data_warehouse.json (1 hunks)
src/workflows/airqo_etl_utils/schema/devices.json (2 hunks)
src/workflows/airqo_etl_utils/schema/latest_measurements.json (1 hunks)
src/workflows/airqo_etl_utils/schema/mobile_measurements.json (2 hunks)

🔇 Additional comments (3)

src/workflows/airqo_etl_utils/data_validator.py (1)

75-78: Improved timestamp handling looks good!

The changes enhance timezone handling by:

Including the '+' character in the regex pattern to support timezone offset notation
Ensuring UTC conversion with utc=True

Let's verify the timestamp formats in the codebase:

✅ Verification successful

Timestamp handling changes are correctly implemented and safe

The verification confirms that:

The regex pattern change to include '+' is compatible with all timestamp formats in the codebase, which primarily use 'Z' suffix (UTC) format
The utc=True parameter in pd.to_datetime() is consistently used across the codebase for timezone handling
The changes maintain backward compatibility while improving timezone offset support

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for timestamp patterns to ensure compatibility
rg -g '*.{py,json}' '[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}[+-][0-9]{4}'

Length of output: 89

Script:

#!/bin/bash
# Let's search for different timestamp patterns in the codebase
# Search for ISO format timestamps with Z suffix
rg -g '*.{py,json}' '[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.?[0-9]*Z'

# Search for timestamp string handling in Python files
rg -g '*.py' 'to_datetime.*utc'

# Look for datetime format patterns
rg -g '*.py' 'datetime.*format|strftime|strptime'

# Check the data_validator.py file for context around the timestamp handling
cat src/workflows/airqo_etl_utils/data_validator.py

Length of output: 35916

src/workflows/airqo_etl_utils/airqo_utils.py (1)

1043-1043: Network-specific device retrieval looks good!

The change ensures consistency by explicitly filtering for AirQo network devices.

src/workflows/airqo_etl_utils/schema/data_warehouse.json (1)

2-337: Verify data consistency across related tables.

The schema changes look good overall, but we should ensure data consistency across related tables.

Let's check for related schema files that might need similar updates:

✅ Verification successful

The previous search revealed several measurement-related schema files. Let's examine the content of the most relevant ones to verify schema consistency and check for any ETL code that might be using these schemas. This will help us determine if the schema changes need to be propagated to other files.

After examining the schema files, I notice some differences in field presence and structure between data_warehouse.json and the related measurement schemas. We should check how these schemas are used in the ETL code and any data transformation logic to ensure the changes are properly handled.

Schema changes are consistent with related tables

After analyzing the schema files and ETL code, I can confirm that the changes in data_warehouse.json are consistent with the schema structure used across related measurement tables. The BigQuery API validates schema consistency during data loading, and the schema changes align with the existing field types and modes in latest_measurements.json and other measurement tables.

Key points verified:

Schema validation is enforced through the validate_data() method in bigquery_api.py
Field types (STRING, FLOAT, TIMESTAMP) are consistent across measurement tables
All fields are consistently marked as NULLABLE
Common measurement fields (pm2_5, pm10, etc.) maintain consistent data types and descriptions

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other measurement-related schema files
fd -e json . | grep -i "measurements\|devices"

# Search for any hardcoded references to these table schemas
rg -l "data_warehouse.json\|latest_measurements.json\|airqo_mobile_measurements.json"

Length of output: 736

Script:

#!/bin/bash
# Let's check the content of these schema files to compare their structure
for file in src/workflows/airqo_etl_utils/schema/{latest_measurements,measurements,airqo_mobile_measurements}.json; do
    echo "=== $file ==="
    cat "$file"
    echo -e "\n\n"
done

# Also check for any ETL code that might be using these schemas
rg -l "schema.*json" src/workflows/airqo_etl_utils/

Length of output: 12508

Script:

#!/bin/bash
# Let's check how these schemas are being used in the ETL code
echo "=== Checking bigquery_api.py for schema usage ==="
cat src/workflows/airqo_etl_utils/bigquery_api.py

# Let's also check if there are any data transformation functions that might be affected
rg -l "transform.*data|process.*measurements" src/workflows/airqo_etl_utils/

Length of output: 35405

coderabbitai · 2024-12-04T11:27:40Z

src/workflows/airqo_etl_utils/schema/devices.json

+  {
+    "name": "network",
+    "type": "STRING",
+    "mode": "REQUIRED"
+  },


💡 Codebase verification

Migration strategy needed for REQUIRED network field

Based on the codebase analysis, while there is code handling network-related operations, there's no explicit data migration strategy for the new REQUIRED network field. Critical observations:

The API code (airqo_api.py) shows network is used in device retrieval operations

The utils code (airqo_utils.py) shows network field is accessed with a fallback: device.get("network", None)

No SQL migration scripts were found to update existing records

No default network assignment logic exists for historical data

Action items needed:

Create a data migration script to populate the network field for existing device records

Update the device retrieval logic to handle the REQUIRED constraint

Remove the get("network", None) fallback as it won't be needed with REQUIRED field

🔗 Analysis chain

Verify data migration strategy for REQUIRED network field.

While making the network field REQUIRED in the devices schema is logically correct (as every device should belong to a network), this change requires careful consideration:

Existing devices in the database will need to have this field populated

Default values may need to be established for historical data

Let's verify the potential impact on existing device records:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check for existing device records that might be affected by the new REQUIRED field # Search for any BigQuery schema migration scripts fd -t f -e sql . | xargs rg -l "ALTER.*devices.*network" # Search for any data migration utilities rg -l "devices.*network" --type py

Length of output: 219

Script:

#!/bin/bash # Let's examine the content of these files to check for any data migration logic # Check airqo_utils.py for device-related functions rg -A 5 "devices.*network" src/workflows/airqo_etl_utils/airqo_utils.py # Check airqo_api.py for network field handling rg -A 5 "devices.*network" src/workflows/airqo_etl_utils/airqo_api.py # Look for any default network values or migration logic ast-grep --pattern 'network = $_' # Search for any BigQuery update statements rg -i "update.*devices.*set.*network"

Length of output: 2317

coderabbitai · 2024-12-04T11:27:40Z

src/workflows/airqo_etl_utils/schema/airqo_mobile_measurements.json

+    "name": "network",
+    "type": "STRING",
+    "mode": "NULLABLE"
+  },


💡 Codebase verification

Schema update is safe, but ETL code needs updating

The ETL pipeline uses dynamic column population through populate_missing_columns, which safely handles new fields. However, there are a few places that need attention:

Update data_columns list in airqo_utils.py to include the new 'network' field for comprehensive data processing

Review column mappings in AirQoDataUtils.Device_Field_Mapping to ensure proper handling of the network field

Consider adding network field population logic in data transformation sections where tenant is being set

🔗 Analysis chain

Network field addition looks good, but consider data migration impact.

The addition of the network field is well-structured and consistent with other schema files. The placement after the tenant field is logical from a data organization perspective.

Let's verify if there are any existing ETL jobs that might need updating:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Search for ETL jobs that might need updating for the new network field rg -l "airqo_mobile_measurements" | grep -i "etl\|pipeline"

Length of output: 190

Script:

#!/bin/bash # Check the content of ETL files for mobile measurements processing echo "=== Checking airqo_utils.py ===" rg -A 5 "airqo_mobile_measurements" src/workflows/airqo_etl_utils/airqo_utils.py echo -e "\n=== Checking bigquery_api.py ===" rg -A 5 "airqo_mobile_measurements" src/workflows/airqo_etl_utils/bigquery_api.py echo -e "\n=== Checking config.py ===" rg -A 5 "airqo_mobile_measurements" src/workflows/airqo_etl_utils/config.py

Length of output: 1496

Script:

#!/bin/bash # Let's check for any data transformation or processing logic that might need updating echo "=== Checking for data processing logic in airqo_utils.py ===" ast-grep --pattern 'def $_(self, $$$) { $$$ airqo_mobile_measurements $$$ }' src/workflows/airqo_etl_utils/airqo_utils.py # Also check for any direct column references echo -e "\n=== Checking for column mappings or transformations ===" rg -B2 -A2 "columns\s*=|schema\s*=" src/workflows/airqo_etl_utils/airqo_utils.py

Length of output: 3135

NicholasTurner23 added 4 commits December 4, 2024 12:44

Add networks to schemas

f8ee1cf

Clean up to remove repeated operations

647c05a

Cleanup datetime conversion

ba74197

Merge branch 'staging' into update/Integration-iqair-devices

9b2d45b

Updates from airqo staging

NicholasTurner23 requested a review from Baalmart December 4, 2024 11:24

coderabbitai bot reviewed Dec 4, 2024

View reviewed changes

Baalmart merged commit 369cafd into airqo-platform:staging Dec 4, 2024
46 checks passed

Baalmart mentioned this pull request Dec 4, 2024

move to production #3994

Merged

2 tasks

This was referenced Dec 4, 2024

Update/integration iqair devices #3999

Merged

Update/integration iqair devices #4001

Merged

Update/integration iqair devices #4003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update/integration iqair devices #3995

Update/integration iqair devices #3995

NicholasTurner23 commented Dec 4, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Dec 4, 2024

coderabbitai bot Dec 4, 2024

Update/integration iqair devices #3995

Update/integration iqair devices #3995

Conversation

NicholasTurner23 commented Dec 4, 2024 • edited by coderabbitai bot Loading

Description

Summary by CodeRabbit

coderabbitai bot commented Dec 4, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

NicholasTurner23 commented Dec 4, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)