Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update/integration iqair devices #3995

Conversation

NicholasTurner23
Copy link
Contributor

@NicholasTurner23 NicholasTurner23 commented Dec 4, 2024

Description

Clean up datetime conversion and clean up

Summary by CodeRabbit

  • New Features

    • Enhanced data handling for timestamps and device data extraction.
    • Added a new "network" field to multiple JSON schemas, allowing for better categorization of measurements and devices.
  • Bug Fixes

    • Improved logic for cleaning low-cost sensor data and formatting data types.
  • Documentation

    • Updated comments and docstrings for clarity in data processing methods.

Copy link
Contributor

coderabbitai bot commented Dec 4, 2024

📝 Walkthrough

Walkthrough

The pull request introduces several modifications to the AirQoDataUtils and DataValidationUtils classes in the airqo_utils.py and data_validator.py files, respectively. Key updates include refined handling of the timestamp column in data processing methods and enhanced logic for device data extraction based on network specifications. Additionally, new fields related to the network have been added to various JSON schema files, ensuring that the data structures accommodate network-related information while maintaining their existing definitions.

Changes

File Path Change Summary
src/workflows/airqo_etl_utils/airqo_utils.py Modified clean_low_cost_sensor_data and extract_devices_deployment_logs methods for improved timestamp handling and device filtering by network.
src/workflows/airqo_etl_utils/data_validator.py Updated format_data_types and remove_outliers methods for better timestamp formatting and validation.
src/workflows/airqo_etl_utils/schema/airqo_mobile_measurements.json Added new field network of type STRING, mode NULLABLE.
src/workflows/airqo_etl_utils/schema/bam_measurements.json Added new field network of type STRING, mode NULLABLE.
src/workflows/airqo_etl_utils/schema/data_warehouse.json Added new field network of type STRING, mode NULLABLE.
src/workflows/airqo_etl_utils/schema/devices.json Added new field network of type STRING, mode REQUIRED.
src/workflows/airqo_etl_utils/schema/latest_measurements.json Added new field network of type STRING, mode NULLABLE.
src/workflows/airqo_etl_utils/schema/mobile_measurements.json Added new field network of type STRING, mode NULLABLE.

Possibly related PRs

Suggested reviewers

  • Baalmart
  • BenjaminSsempala

🎉 In the realm of data, changes unfold,
With timestamps refined and stories retold.
Networks now woven in schemas so bright,
Enhancing the flow, bringing data to light.
From devices to logs, all neatly aligned,
A tapestry of insights, beautifully designed! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (5)
src/workflows/airqo_etl_utils/schema/bam_measurements.json (1)

7-11: Consider adding a description field for the network attribute.

The addition of the network field is consistent with the schema updates across other measurement files. However, unlike other fields (e.g., pm2_5, no2), this field lacks a description of its purpose and expected values.

Consider adding a description field to improve schema documentation:

  {
    "name": "network",
    "type": "STRING",
    "mode": "NULLABLE",
+   "description": "Identifies the network associated with the BAM measurement (e.g., 'airqo', 'iqair')."
  },
src/workflows/airqo_etl_utils/schema/mobile_measurements.json (1)

7-11: Schema changes are consistent but consider documenting the network values.

The addition of the network field maintains consistency with other measurement schemas. However, to ensure proper data validation and usage:

  1. Consider documenting the allowed/expected network values
  2. Consider adding data validation rules if specific network values are expected

Consider creating a centralized schema documentation that defines:

  • Valid network values across all schemas
  • Relationships between networks in different schemas
  • Data validation rules for network values
src/workflows/airqo_etl_utils/data_validator.py (1)

Line range hint 761-780: Network-specific data processing looks good, but consider extracting the PM calculations.

The changes correctly handle:

  1. Timestamp conversion when outliers are not removed
  2. Network-specific mean calculations for PM2.5 and PM10

However, the PM calculations could be extracted into a separate method for better maintainability.

Consider extracting the PM calculations into a dedicated method:

-            is_airqo_network = data["network"] == "airqo"
-
-            pm2_5_mean = data.loc[is_airqo_network, ["s1_pm2_5", "s2_pm2_5"]].mean(
-                axis=1
-            )
-            pm10_mean = data.loc[is_airqo_network, ["s1_pm10", "s2_pm10"]].mean(axis=1)
-
-            data.loc[is_airqo_network, "pm2_5_raw_value"] = pm2_5_mean
-            data.loc[is_airqo_network, "pm2_5"] = pm2_5_mean
-            data.loc[is_airqo_network, "pm10_raw_value"] = pm10_mean
-            data.loc[is_airqo_network, "pm10"] = pm10_mean
+            data = self._calculate_network_pm_means(data)
+
+    @staticmethod
+    def _calculate_network_pm_means(data: pd.DataFrame) -> pd.DataFrame:
+        """Calculate PM2.5 and PM10 means for AirQo network data."""
+        is_airqo_network = data["network"] == "airqo"
+        
+        pm2_5_mean = data.loc[is_airqo_network, ["s1_pm2_5", "s2_pm2_5"]].mean(axis=1)
+        pm10_mean = data.loc[is_airqo_network, ["s1_pm10", "s2_pm10"]].mean(axis=1)
+        
+        data.loc[is_airqo_network, "pm2_5_raw_value"] = pm2_5_mean
+        data.loc[is_airqo_network, "pm2_5"] = pm2_5_mean
+        data.loc[is_airqo_network, "pm10_raw_value"] = pm10_mean
+        data.loc[is_airqo_network, "pm10"] = pm10_mean
+        
+        return data
src/workflows/airqo_etl_utils/schema/latest_measurements.json (1)

7-11: Schema change is consistent with related tables.

The network field addition maintains consistency with other measurement schemas, which is crucial for data warehouse design.

Consider documenting the relationship between tenant and network fields in the schema, as they appear to be related hierarchical identifiers.

src/workflows/airqo_etl_utils/schema/data_warehouse.json (1)

7-11: Consider adding field description for network context.

While the field addition is correct, adding a description would help clarify its purpose and expected values, similar to how other fields in this schema have descriptions.

Consider adding a description like this:

  {
    "name": "network",
    "type": "STRING",
-    "mode": "NULLABLE"
+    "mode": "NULLABLE",
+    "description": "Identifies the network to which the device or measurement belongs"
  }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 50c8a62 and 9b2d45b.

📒 Files selected for processing (8)
  • src/workflows/airqo_etl_utils/airqo_utils.py (2 hunks)
  • src/workflows/airqo_etl_utils/data_validator.py (1 hunks)
  • src/workflows/airqo_etl_utils/schema/airqo_mobile_measurements.json (1 hunks)
  • src/workflows/airqo_etl_utils/schema/bam_measurements.json (2 hunks)
  • src/workflows/airqo_etl_utils/schema/data_warehouse.json (1 hunks)
  • src/workflows/airqo_etl_utils/schema/devices.json (2 hunks)
  • src/workflows/airqo_etl_utils/schema/latest_measurements.json (1 hunks)
  • src/workflows/airqo_etl_utils/schema/mobile_measurements.json (2 hunks)
🔇 Additional comments (3)
src/workflows/airqo_etl_utils/data_validator.py (1)

75-78: Improved timestamp handling looks good!

The changes enhance timezone handling by:

  1. Including the '+' character in the regex pattern to support timezone offset notation
  2. Ensuring UTC conversion with utc=True

Let's verify the timestamp formats in the codebase:

✅ Verification successful

Timestamp handling changes are correctly implemented and safe

The verification confirms that:

  1. The regex pattern change to include '+' is compatible with all timestamp formats in the codebase, which primarily use 'Z' suffix (UTC) format
  2. The utc=True parameter in pd.to_datetime() is consistently used across the codebase for timezone handling
  3. The changes maintain backward compatibility while improving timezone offset support
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for timestamp patterns to ensure compatibility
rg -g '*.{py,json}' '[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}[+-][0-9]{4}'

Length of output: 89


Script:

#!/bin/bash
# Let's search for different timestamp patterns in the codebase
# Search for ISO format timestamps with Z suffix
rg -g '*.{py,json}' '[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.?[0-9]*Z'

# Search for timestamp string handling in Python files
rg -g '*.py' 'to_datetime.*utc'

# Look for datetime format patterns
rg -g '*.py' 'datetime.*format|strftime|strptime'

# Check the data_validator.py file for context around the timestamp handling
cat src/workflows/airqo_etl_utils/data_validator.py

Length of output: 35916

src/workflows/airqo_etl_utils/airqo_utils.py (1)

1043-1043: Network-specific device retrieval looks good!

The change ensures consistency by explicitly filtering for AirQo network devices.

src/workflows/airqo_etl_utils/schema/data_warehouse.json (1)

2-337: Verify data consistency across related tables.

The schema changes look good overall, but we should ensure data consistency across related tables.

Let's check for related schema files that might need similar updates:

✅ Verification successful

The previous search revealed several measurement-related schema files. Let's examine the content of the most relevant ones to verify schema consistency and check for any ETL code that might be using these schemas. This will help us determine if the schema changes need to be propagated to other files.


After examining the schema files, I notice some differences in field presence and structure between data_warehouse.json and the related measurement schemas. We should check how these schemas are used in the ETL code and any data transformation logic to ensure the changes are properly handled.


Schema changes are consistent with related tables

After analyzing the schema files and ETL code, I can confirm that the changes in data_warehouse.json are consistent with the schema structure used across related measurement tables. The BigQuery API validates schema consistency during data loading, and the schema changes align with the existing field types and modes in latest_measurements.json and other measurement tables.

Key points verified:

  • Schema validation is enforced through the validate_data() method in bigquery_api.py
  • Field types (STRING, FLOAT, TIMESTAMP) are consistent across measurement tables
  • All fields are consistently marked as NULLABLE
  • Common measurement fields (pm2_5, pm10, etc.) maintain consistent data types and descriptions
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other measurement-related schema files
fd -e json . | grep -i "measurements\|devices"

# Search for any hardcoded references to these table schemas
rg -l "data_warehouse.json\|latest_measurements.json\|airqo_mobile_measurements.json"

Length of output: 736


Script:

#!/bin/bash
# Let's check the content of these schema files to compare their structure
for file in src/workflows/airqo_etl_utils/schema/{latest_measurements,measurements,airqo_mobile_measurements}.json; do
    echo "=== $file ==="
    cat "$file"
    echo -e "\n\n"
done

# Also check for any ETL code that might be using these schemas
rg -l "schema.*json" src/workflows/airqo_etl_utils/

Length of output: 12508


Script:

#!/bin/bash
# Let's check how these schemas are being used in the ETL code
echo "=== Checking bigquery_api.py for schema usage ==="
cat src/workflows/airqo_etl_utils/bigquery_api.py

# Let's also check if there are any data transformation functions that might be affected
rg -l "transform.*data|process.*measurements" src/workflows/airqo_etl_utils/

Length of output: 35405

Comment on lines +7 to +11
{
"name": "network",
"type": "STRING",
"mode": "REQUIRED"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Migration strategy needed for REQUIRED network field

Based on the codebase analysis, while there is code handling network-related operations, there's no explicit data migration strategy for the new REQUIRED network field. Critical observations:

  1. The API code (airqo_api.py) shows network is used in device retrieval operations
  2. The utils code (airqo_utils.py) shows network field is accessed with a fallback: device.get("network", None)
  3. No SQL migration scripts were found to update existing records
  4. No default network assignment logic exists for historical data

Action items needed:

  • Create a data migration script to populate the network field for existing device records
  • Update the device retrieval logic to handle the REQUIRED constraint
  • Remove the get("network", None) fallback as it won't be needed with REQUIRED field
🔗 Analysis chain

Verify data migration strategy for REQUIRED network field.

While making the network field REQUIRED in the devices schema is logically correct (as every device should belong to a network), this change requires careful consideration:

  1. Existing devices in the database will need to have this field populated
  2. Default values may need to be established for historical data

Let's verify the potential impact on existing device records:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for existing device records that might be affected by the new REQUIRED field

# Search for any BigQuery schema migration scripts
fd -t f -e sql . | xargs rg -l "ALTER.*devices.*network"

# Search for any data migration utilities
rg -l "devices.*network" --type py

Length of output: 219


Script:

#!/bin/bash
# Let's examine the content of these files to check for any data migration logic

# Check airqo_utils.py for device-related functions
rg -A 5 "devices.*network" src/workflows/airqo_etl_utils/airqo_utils.py

# Check airqo_api.py for network field handling
rg -A 5 "devices.*network" src/workflows/airqo_etl_utils/airqo_api.py

# Look for any default network values or migration logic
ast-grep --pattern 'network = $_' 

# Search for any BigQuery update statements
rg -i "update.*devices.*set.*network"

Length of output: 2317

Comment on lines +7 to +11
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Schema update is safe, but ETL code needs updating

The ETL pipeline uses dynamic column population through populate_missing_columns, which safely handles new fields. However, there are a few places that need attention:

  • Update data_columns list in airqo_utils.py to include the new 'network' field for comprehensive data processing
  • Review column mappings in AirQoDataUtils.Device_Field_Mapping to ensure proper handling of the network field
  • Consider adding network field population logic in data transformation sections where tenant is being set
🔗 Analysis chain

Network field addition looks good, but consider data migration impact.

The addition of the network field is well-structured and consistent with other schema files. The placement after the tenant field is logical from a data organization perspective.

Let's verify if there are any existing ETL jobs that might need updating:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for ETL jobs that might need updating for the new network field
rg -l "airqo_mobile_measurements" | grep -i "etl\|pipeline"

Length of output: 190


Script:

#!/bin/bash
# Check the content of ETL files for mobile measurements processing
echo "=== Checking airqo_utils.py ==="
rg -A 5 "airqo_mobile_measurements" src/workflows/airqo_etl_utils/airqo_utils.py

echo -e "\n=== Checking bigquery_api.py ==="
rg -A 5 "airqo_mobile_measurements" src/workflows/airqo_etl_utils/bigquery_api.py

echo -e "\n=== Checking config.py ==="
rg -A 5 "airqo_mobile_measurements" src/workflows/airqo_etl_utils/config.py

Length of output: 1496


Script:

#!/bin/bash
# Let's check for any data transformation or processing logic that might need updating
echo "=== Checking for data processing logic in airqo_utils.py ==="
ast-grep --pattern 'def $_(self, $$$) {
  $$$
  airqo_mobile_measurements
  $$$
}' src/workflows/airqo_etl_utils/airqo_utils.py

# Also check for any direct column references
echo -e "\n=== Checking for column mappings or transformations ==="
rg -B2 -A2 "columns\s*=|schema\s*=" src/workflows/airqo_etl_utils/airqo_utils.py

Length of output: 3135

@Baalmart Baalmart merged commit 369cafd into airqo-platform:staging Dec 4, 2024
46 checks passed
@Baalmart Baalmart mentioned this pull request Dec 4, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants