Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update fix/clean up #4058

Merged
merged 5 commits into from
Dec 12, 2024

Conversation

NicholasTurner23
Copy link
Contributor

@NicholasTurner23 NicholasTurner23 commented Dec 12, 2024

Description

[Provide a brief description of the changes made in this PR]

  1. Uses new tables to store and update sites and devices data in bigquery.
  2. Updates analytics to use new the new devices and sites tables.
  3. Updates the data cleaning method to drop empty columns.

Related Issues

  • JIRA cards:
    • OPS-324

Summary by CodeRabbit

  • New Features

    • Added new configuration variables for BigQuery device and site tables.
    • Introduced new attributes for managing BigQuery tables in the API.
  • Bug Fixes

    • Enhanced error handling in data cleaning methods to raise errors for missing required columns.
  • Schema Updates

    • Updated JSON schemas for devices and sites to enforce required fields and modified data types.
  • Documentation

    • Improved clarity in configuration and schema mappings to ensure consistency with new variables.

Copy link
Contributor

coderabbitai bot commented Dec 12, 2024

📝 Walkthrough

Walkthrough

The pull request introduces significant updates across several files, primarily focusing on the EventsModel class and related configurations. Key changes include the addition of new class-level constants and instance variables for BigQuery table references, modifications to properties and methods to utilize these new references, and enhancements to data cleaning logic. Additionally, new environment variables are added for configuration management, and JSON schemas for devices and sites are updated to enforce stricter requirements. The overall structure and functionality of the affected classes remain intact.

Changes

File Path Change Summary
src/analytics/api/models/events.py Added constants BIGQUERY_SITES_SITES and BIGQUERY_DEVICES_DEVICES. Updated constructor and methods to use new table references. Modified data cleaning logic in simple_data_cleaning.
src/analytics/config.py Added environment variables BIGQUERY_DEVICES_DEVICES and BIGQUERY_SITES_SITES.
src/workflows/airqo_etl_utils/bigquery_api.py Added attributes self.sites_sites_table and self.devices_devices_table to manage new BigQuery tables.
src/workflows/airqo_etl_utils/config.py Added environment variables BIGQUERY_DEVICES_DEVICES_TABLE and BIGQUERY_SITES_SITES_TABLE. Updated SCHEMA_FILE_MAPPING for new variables.
src/workflows/airqo_etl_utils/meta_data_utils.py Modified extract_sites_from_api to remove location field and set last_updated to current UTC time.
src/workflows/airqo_etl_utils/schema/devices.json Changed device_id and last_updated fields from NULLABLE to REQUIRED.
src/workflows/airqo_etl_utils/schema/sites.json Removed tenant and location fields. Changed last_updated type from DATE to TIMESTAMP.
src/workflows/dags/meta_data.py Renamed big_query_api.sites_table to big_query_api.sites_sites_table and big_query_api.devices_table to big_query_api.devices_devices_table.

Possibly related PRs

Suggested reviewers

  • Baalmart
  • BenjaminSsempala
  • Psalmz777

🎉 In the land of code, where changes flow,
New constants arise, like stars in a show.
With tables updated and schemas refined,
The data now dances, perfectly aligned.
So let’s raise a cheer for this marvelous feat,
In the realm of BigQuery, our work is complete! 🎊


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
src/analytics/config.py (2)

35-37: Document new environment variables in deployment guides

The addition of BIGQUERY_DEVICES_DEVICES and BIGQUERY_SITES_SITES configurations supports the schema changes, but requires documentation updates:

  1. Update deployment documentation to include these new variables
  2. Consider adding comments explaining the purpose of these tables

Consider creating a configuration documentation file that explains:

  • Purpose of each table
  • Relationships between tables
  • Required environment variables

35-38: Consider grouping related BigQuery configurations

The BigQuery configurations are growing in number. Consider organizing related configurations together for better maintainability.

class Config:
    # ... other configs ...
    
    # Device-related BigQuery tables
-   BIGQUERY_DEVICES = env_var("BIGQUERY_DEVICES")
-   BIGQUERY_DEVICES_DEVICES = env_var("BIGQUERY_DEVICES_DEVICES")
+   BIGQUERY_TABLES_DEVICE = {
+       "devices": env_var("BIGQUERY_DEVICES"),
+       "devices_devices": env_var("BIGQUERY_DEVICES_DEVICES"),
+   }
    
    # Site-related BigQuery tables
-   BIGQUERY_SITES = env_var("BIGQUERY_SITES")
-   BIGQUERY_SITES_SITES = env_var("BIGQUERY_SITES_SITES")
-   BIGQUERY_AIRQLOUDS_SITES = env_var("BIGQUERY_AIRQLOUDS_SITES")
+   BIGQUERY_TABLES_SITE = {
+       "sites": env_var("BIGQUERY_SITES"),
+       "sites_sites": env_var("BIGQUERY_SITES_SITES"),
+       "airqlouds_sites": env_var("BIGQUERY_AIRQLOUDS_SITES"),
+   }
src/workflows/airqo_etl_utils/meta_data_utils.py (1)

149-149: LGTM! Consider moving timestamp logic to a utility function

The addition of the last_updated timestamp using UTC is correct and well-placed. However, since this pattern might be used in other places (like in the extract_devices_from_api method), consider extracting it to a utility function.

Consider creating a utility function:

@staticmethod
def add_last_updated(df: pd.DataFrame) -> pd.DataFrame:
    df["last_updated"] = datetime.now(timezone.utc)
    return df
src/workflows/airqo_etl_utils/bigquery_api.py (1)

Line range hint 545-545: Remove debug file write operation.

The line dataframe.to_csv("raw_data50.csv") appears to be debug code that was accidentally left in. Writing to files in production can cause disk space and performance issues.

-        dataframe.to_csv("raw_data50.csv")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b2cde0 and c50c928.

📒 Files selected for processing (8)
  • src/analytics/api/models/events.py (9 hunks)
  • src/analytics/config.py (1 hunks)
  • src/workflows/airqo_etl_utils/bigquery_api.py (2 hunks)
  • src/workflows/airqo_etl_utils/config.py (2 hunks)
  • src/workflows/airqo_etl_utils/meta_data_utils.py (1 hunks)
  • src/workflows/airqo_etl_utils/schema/devices.json (2 hunks)
  • src/workflows/airqo_etl_utils/schema/sites.json (1 hunks)
  • src/workflows/dags/meta_data.py (2 hunks)
🔇 Additional comments (9)
src/workflows/airqo_etl_utils/schema/devices.json (1)

25-25: Schema changes require careful migration planning

The changes to make device_id and last_updated required fields improve data integrity by ensuring essential tracking information is always present. However, this introduces stricter requirements that need careful consideration:

  1. Existing records might need updating to comply with the new schema
  2. Data ingestion processes must be updated to always provide these fields

Also applies to: 55-55

✅ Verification successful

Let me search for more specific patterns to understand the data ingestion and validation mechanisms in place. The previous search might have been too specific, so I'm broadening the scope while maintaining focus on device-related operations.


Schema change appears safe to proceed

Based on the codebase analysis, the schema changes to make device_id and last_updated required fields appear safe to implement because:

  1. Existing code already treats device_id as a critical field:

    • Device validation logic is in place across multiple services
    • The BigQuery connector already validates device_id presence (device_id is not None and device_id.strip() != "")
  2. Data ingestion processes handle these fields appropriately:

    • The Kafka-BigQuery connector includes device_id in its transformations
    • ETL utilities consistently include device-related fields in their schemas
    • Validation mechanisms are present in the device-registry service

The change aligns with existing practices and won't disrupt current operations since the fields are already being treated as essential in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any existing records that might violate the new schema

# Search for relevant data ingestion code
rg -A 3 "INSERT.*devices.*table" 

# Look for potential data validation or transformation logic
ast-grep --pattern 'def $_(self, $$$):
  $$$
  device_id
  $$$'

Length of output: 100


Script:

#!/bin/bash
# Let's try a broader search for device-related ingestion patterns
rg -l "device.*INSERT|INSERT.*device" 

# Search for BigQuery table operations
rg -A 3 "bigquery.*device" 

# Look for data transformation or ETL files
fd -t f "etl|transform|ingest" 

# Check for any validation logic in Python files
rg -A 3 "device_id.*None|device_id.*required" 

# Search for any device-related schema validation
rg -A 3 "schema.*device|validate.*device"

Length of output: 11287

src/workflows/airqo_etl_utils/schema/sites.json (1)

69-69: Verify data preservation strategy for schema changes

The change from DATE to TIMESTAMP for last_updated provides better temporal precision, which aligns well with real-time data tracking needs. However, there are some considerations:

  1. The increased precision requires proper timezone handling
  2. Existing DATE values will need conversion to TIMESTAMP
✅ Verification successful

Timezone handling is properly implemented for TIMESTAMP data

The codebase shows comprehensive timezone handling across multiple components:

  • ETL workflows consistently use UTC timezone (datetime.now(timezone.utc)) for data processing
  • BigQuery operations handle timezone conversion (pd.to_datetime(..., utc=True))
  • The last_updated field is consistently populated with UTC timestamps in data processing flows

The change from DATE to TIMESTAMP is well-supported by the existing timezone-aware infrastructure, ensuring data consistency and proper temporal tracking.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Look for timezone handling in the codebase
rg -A 3 "datetime.*utc|timezone"

# Search for date/time conversion logic
ast-grep --pattern 'def $_(self, $$$):
  $$$
  last_updated
  $$$'

Length of output: 54290

src/workflows/airqo_etl_utils/config.py (1)

60-63: LGTM! Verify environment variables in deployment

The new BigQuery table configurations and their schema mappings are correctly implemented. Ensure these environment variables are set in all deployment environments.

Run the following script to verify the environment variables:

Also applies to: 376-381

src/workflows/airqo_etl_utils/bigquery_api.py (2)

49-49: LGTM! New table references properly integrated.

The addition of sites_sites_table and devices_devices_table is well-structured and consistent with the existing pattern of table reference declarations. These additions align with the PR objective of introducing new tables for improved data management.

Also applies to: 58-58


Line range hint 558-577: Excellent docstring improvements!

The updated docstring is comprehensive and well-structured, providing clear documentation of:

  • Method purpose and behavior
  • Data cleaning steps
  • Parameters and return values
  • Exception handling
src/analytics/api/models/events.py (4)

30-30: LGTM! Constants properly defined.

The new constants BIGQUERY_SITES_SITES and BIGQUERY_DEVICES_DEVICES are well-defined and consistent with the existing naming pattern. They properly reference the configuration values using f-strings.

Also applies to: 32-32


56-56: LGTM! Instance variables properly initialized.

The new instance variables are correctly initialized using their corresponding constants, maintaining consistency with the existing initialization pattern.

Also applies to: 59-59


67-68: LGTM! Query properties updated consistently.

The query properties have been properly updated to use the new table references while maintaining the existing query structure and functionality.

Also applies to: 74-74, 79-79


99-100: LGTM! Query methods updated consistently.

The query methods have been properly updated to use the new table references while preserving the existing query logic and functionality.

Also applies to: 117-118, 134-135

@@ -46,7 +46,7 @@ def load_sites(data: pd.DataFrame):
big_query_api = BigQueryApi()
big_query_api.update_sites_and_devices(
dataframe=data,
table=big_query_api.sites_table,
table=big_query_api.sites_sites_table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Inconsistent table references across DAGs

While the first DAG has been updated to use the new table references (sites_sites_table and devices_devices_table), the second DAG "Update-BigQuery-Sites-Grids-And-Devices" still uses the old references (sites_table and devices_table). This inconsistency could lead to data synchronization issues.

Apply this diff to update the second DAG:

    def load_sites(data: pd.DataFrame):
        from airqo_etl_utils.bigquery_api import BigQueryApi

        big_query_api = BigQueryApi()
        big_query_api.update_sites_and_devices(
            dataframe=data,
-           table=big_query_api.sites_table,
+           table=big_query_api.sites_sites_table,
            component="sites",
        )

    def load_devices(data: pd.DataFrame):
        from airqo_etl_utils.bigquery_api import BigQueryApi

        big_query_api = BigQueryApi()
        big_query_api.update_sites_and_devices(
            dataframe=data,
-           table=big_query_api.devices_table,
+           table=big_query_api.devices_devices_table,
            component="devices",
        )

Also applies to: 78-78

@Baalmart Baalmart merged commit 1396ff4 into airqo-platform:staging Dec 12, 2024
45 of 46 checks passed
@Baalmart Baalmart mentioned this pull request Dec 12, 2024
1 task
@coderabbitai coderabbitai bot mentioned this pull request Dec 13, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants