Update fix/clean up #4058

NicholasTurner23 · 2024-12-12T17:40:47Z

Description

[Provide a brief description of the changes made in this PR]

Uses new tables to store and update sites and devices data in bigquery.
Updates analytics to use new the new devices and sites tables.
Updates the data cleaning method to drop empty columns.

Related Issues

JIRA cards:
- OPS-324

Summary by CodeRabbit

New Features
- Added new configuration variables for BigQuery device and site tables.
- Introduced new attributes for managing BigQuery tables in the API.
Bug Fixes
- Enhanced error handling in data cleaning methods to raise errors for missing required columns.
Schema Updates
- Updated JSON schemas for devices and sites to enforce required fields and modified data types.
Documentation
- Improved clarity in configuration and schema mappings to ensure consistency with new variables.

Updates from airqo staging

coderabbitai · 2024-12-12T17:40:54Z

📝 Walkthrough

Walkthrough

The pull request introduces significant updates across several files, primarily focusing on the EventsModel class and related configurations. Key changes include the addition of new class-level constants and instance variables for BigQuery table references, modifications to properties and methods to utilize these new references, and enhancements to data cleaning logic. Additionally, new environment variables are added for configuration management, and JSON schemas for devices and sites are updated to enforce stricter requirements. The overall structure and functionality of the affected classes remain intact.

Changes

File Path	Change Summary
`src/analytics/api/models/events.py`	Added constants `BIGQUERY_SITES_SITES` and `BIGQUERY_DEVICES_DEVICES`. Updated constructor and methods to use new table references. Modified data cleaning logic in `simple_data_cleaning`.
`src/analytics/config.py`	Added environment variables `BIGQUERY_DEVICES_DEVICES` and `BIGQUERY_SITES_SITES`.
`src/workflows/airqo_etl_utils/bigquery_api.py`	Added attributes `self.sites_sites_table` and `self.devices_devices_table` to manage new BigQuery tables.
`src/workflows/airqo_etl_utils/config.py`	Added environment variables `BIGQUERY_DEVICES_DEVICES_TABLE` and `BIGQUERY_SITES_SITES_TABLE`. Updated `SCHEMA_FILE_MAPPING` for new variables.
`src/workflows/airqo_etl_utils/meta_data_utils.py`	Modified `extract_sites_from_api` to remove `location` field and set `last_updated` to current UTC time.
`src/workflows/airqo_etl_utils/schema/devices.json`	Changed `device_id` and `last_updated` fields from `NULLABLE` to `REQUIRED`.
`src/workflows/airqo_etl_utils/schema/sites.json`	Removed `tenant` and `location` fields. Changed `last_updated` type from `DATE` to `TIMESTAMP`.
`src/workflows/dags/meta_data.py`	Renamed `big_query_api.sites_table` to `big_query_api.sites_sites_table` and `big_query_api.devices_table` to `big_query_api.devices_devices_table`.

Possibly related PRs

Update fix/analytics data export cleanup #3816: Modifies the EventsModel class in src/analytics/api/models/events.py, related to the main PR's changes in the same class.
Update fix/analytics data export cleanup #3819: Involves changes to the EventsModel class, particularly in the download_from_bigquery method, aligning with the main PR's updates.
Update fix/analytics data export cleanup #4037: Introduces a new parameter filter_columns in the download_from_bigquery method, relevant to the main PR's updates.
Update fix/clean up #4050: Focuses on correcting updates to the devices and sites tables in BigQuery, relevant to the main PR's changes in data retrieval.
Update fix/clean up #4054: Modifies the extract_devices_from_api method in the MetaDataUtils class, related to the main PR's focus on device data.

Suggested reviewers

Baalmart
BenjaminSsempala
Psalmz777

🎉 In the land of code, where changes flow,
New constants arise, like stars in a show.
With tables updated and schemas refined,
The data now dances, perfectly aligned.
So let’s raise a cheer for this marvelous feat,
In the realm of BigQuery, our work is complete! 🎊

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)

src/analytics/config.py (2)
35-37: Document new environment variables in deployment guides

The addition of BIGQUERY_DEVICES_DEVICES and BIGQUERY_SITES_SITES configurations supports the schema changes, but requires documentation updates:

Update deployment documentation to include these new variables

Consider adding comments explaining the purpose of these tables

Consider creating a configuration documentation file that explains:

Purpose of each table

Relationships between tables

Required environment variables

35-38: Consider grouping related BigQuery configurations

The BigQuery configurations are growing in number. Consider organizing related configurations together for better maintainability.
class Config:
    # ... other configs ...
    
    # Device-related BigQuery tables
-   BIGQUERY_DEVICES = env_var("BIGQUERY_DEVICES")
-   BIGQUERY_DEVICES_DEVICES = env_var("BIGQUERY_DEVICES_DEVICES")
+   BIGQUERY_TABLES_DEVICE = {
+       "devices": env_var("BIGQUERY_DEVICES"),
+       "devices_devices": env_var("BIGQUERY_DEVICES_DEVICES"),
+   }
    
    # Site-related BigQuery tables
-   BIGQUERY_SITES = env_var("BIGQUERY_SITES")
-   BIGQUERY_SITES_SITES = env_var("BIGQUERY_SITES_SITES")
-   BIGQUERY_AIRQLOUDS_SITES = env_var("BIGQUERY_AIRQLOUDS_SITES")
+   BIGQUERY_TABLES_SITE = {
+       "sites": env_var("BIGQUERY_SITES"),
+       "sites_sites": env_var("BIGQUERY_SITES_SITES"),
+       "airqlouds_sites": env_var("BIGQUERY_AIRQLOUDS_SITES"),
+   }
src/workflows/airqo_etl_utils/meta_data_utils.py (1)
149-149: LGTM! Consider moving timestamp logic to a utility function

The addition of the last_updated timestamp using UTC is correct and well-placed. However, since this pattern might be used in other places (like in the extract_devices_from_api method), consider extracting it to a utility function.

Consider creating a utility function:
@staticmethod
def add_last_updated(df: pd.DataFrame) -> pd.DataFrame:
    df["last_updated"] = datetime.now(timezone.utc)
    return df
src/workflows/airqo_etl_utils/bigquery_api.py (1)
Line range hint 545-545: Remove debug file write operation.

The line dataframe.to_csv("raw_data50.csv") appears to be debug code that was accidentally left in. Writing to files in production can cause disk space and performance issues.
-        dataframe.to_csv("raw_data50.csv")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b2cde0 and c50c928.

📒 Files selected for processing (8)

src/analytics/api/models/events.py (9 hunks)
src/analytics/config.py (1 hunks)
src/workflows/airqo_etl_utils/bigquery_api.py (2 hunks)
src/workflows/airqo_etl_utils/config.py (2 hunks)
src/workflows/airqo_etl_utils/meta_data_utils.py (1 hunks)
src/workflows/airqo_etl_utils/schema/devices.json (2 hunks)
src/workflows/airqo_etl_utils/schema/sites.json (1 hunks)
src/workflows/dags/meta_data.py (2 hunks)

🔇 Additional comments (9)

src/workflows/airqo_etl_utils/schema/devices.json (1)

25-25: Schema changes require careful migration planning

The changes to make device_id and last_updated required fields improve data integrity by ensuring essential tracking information is always present. However, this introduces stricter requirements that need careful consideration:

Existing records might need updating to comply with the new schema
Data ingestion processes must be updated to always provide these fields

Also applies to: 55-55

✅ Verification successful

Let me search for more specific patterns to understand the data ingestion and validation mechanisms in place. The previous search might have been too specific, so I'm broadening the scope while maintaining focus on device-related operations.

Schema change appears safe to proceed

Based on the codebase analysis, the schema changes to make device_id and last_updated required fields appear safe to implement because:

Existing code already treats device_id as a critical field:
- Device validation logic is in place across multiple services
- The BigQuery connector already validates device_id presence (device_id is not None and device_id.strip() != "")
Data ingestion processes handle these fields appropriately:
- The Kafka-BigQuery connector includes device_id in its transformations
- ETL utilities consistently include device-related fields in their schemas
- Validation mechanisms are present in the device-registry service

The change aligns with existing practices and won't disrupt current operations since the fields are already being treated as essential in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any existing records that might violate the new schema

# Search for relevant data ingestion code
rg -A 3 "INSERT.*devices.*table" 

# Look for potential data validation or transformation logic
ast-grep --pattern 'def $_(self, $$$):
  $$$
  device_id
  $$$'

Length of output: 100

Script:

#!/bin/bash
# Let's try a broader search for device-related ingestion patterns
rg -l "device.*INSERT|INSERT.*device" 

# Search for BigQuery table operations
rg -A 3 "bigquery.*device" 

# Look for data transformation or ETL files
fd -t f "etl|transform|ingest" 

# Check for any validation logic in Python files
rg -A 3 "device_id.*None|device_id.*required" 

# Search for any device-related schema validation
rg -A 3 "schema.*device|validate.*device"

Length of output: 11287

src/workflows/airqo_etl_utils/schema/sites.json (1)

69-69: Verify data preservation strategy for schema changes

The change from DATE to TIMESTAMP for last_updated provides better temporal precision, which aligns well with real-time data tracking needs. However, there are some considerations:

The increased precision requires proper timezone handling
Existing DATE values will need conversion to TIMESTAMP

✅ Verification successful

Timezone handling is properly implemented for TIMESTAMP data

The codebase shows comprehensive timezone handling across multiple components:

ETL workflows consistently use UTC timezone (datetime.now(timezone.utc)) for data processing
BigQuery operations handle timezone conversion (pd.to_datetime(..., utc=True))
The last_updated field is consistently populated with UTC timestamps in data processing flows

The change from DATE to TIMESTAMP is well-supported by the existing timezone-aware infrastructure, ensuring data consistency and proper temporal tracking.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Look for timezone handling in the codebase
rg -A 3 "datetime.*utc|timezone"

# Search for date/time conversion logic
ast-grep --pattern 'def $_(self, $$$):
  $$$
  last_updated
  $$$'

Length of output: 54290

src/workflows/airqo_etl_utils/config.py (1)

60-63: LGTM! Verify environment variables in deployment

The new BigQuery table configurations and their schema mappings are correctly implemented. Ensure these environment variables are set in all deployment environments.

Run the following script to verify the environment variables:

Also applies to: 376-381

src/workflows/airqo_etl_utils/bigquery_api.py (2)

49-49: LGTM! New table references properly integrated.

The addition of sites_sites_table and devices_devices_table is well-structured and consistent with the existing pattern of table reference declarations. These additions align with the PR objective of introducing new tables for improved data management.

Also applies to: 58-58

Line range hint 558-577: Excellent docstring improvements!

The updated docstring is comprehensive and well-structured, providing clear documentation of:

Method purpose and behavior
Data cleaning steps
Parameters and return values
Exception handling

src/analytics/api/models/events.py (4)

30-30: LGTM! Constants properly defined.

The new constants BIGQUERY_SITES_SITES and BIGQUERY_DEVICES_DEVICES are well-defined and consistent with the existing naming pattern. They properly reference the configuration values using f-strings.

Also applies to: 32-32

56-56: LGTM! Instance variables properly initialized.

The new instance variables are correctly initialized using their corresponding constants, maintaining consistency with the existing initialization pattern.

Also applies to: 59-59

67-68: LGTM! Query properties updated consistently.

The query properties have been properly updated to use the new table references while maintaining the existing query structure and functionality.

Also applies to: 74-74, 79-79

99-100: LGTM! Query methods updated consistently.

The query methods have been properly updated to use the new table references while preserving the existing query logic and functionality.

Also applies to: 117-118, 134-135

coderabbitai · 2024-12-12T17:43:15Z

src/workflows/dags/meta_data.py

@@ -46,7 +46,7 @@ def load_sites(data: pd.DataFrame):
        big_query_api = BigQueryApi()
        big_query_api.update_sites_and_devices(
            dataframe=data,
-            table=big_query_api.sites_table,
+            table=big_query_api.sites_sites_table,


⚠️ Potential issue

Inconsistent table references across DAGs

While the first DAG has been updated to use the new table references (sites_sites_table and devices_devices_table), the second DAG "Update-BigQuery-Sites-Grids-And-Devices" still uses the old references (sites_table and devices_table). This inconsistency could lead to data synchronization issues.

Apply this diff to update the second DAG:

def load_sites(data: pd.DataFrame): from airqo_etl_utils.bigquery_api import BigQueryApi big_query_api = BigQueryApi() big_query_api.update_sites_and_devices( dataframe=data, - table=big_query_api.sites_table, + table=big_query_api.sites_sites_table, component="sites", ) def load_devices(data: pd.DataFrame): from airqo_etl_utils.bigquery_api import BigQueryApi big_query_api = BigQueryApi() big_query_api.update_sites_and_devices( dataframe=data, - table=big_query_api.devices_table, + table=big_query_api.devices_devices_table, component="devices", )

Also applies to: 78-78

NicholasTurner23 added 5 commits December 12, 2024 19:21

Create new datatable without tenant clustering

920cfa8

Merge branch 'staging' into update-fix/Clean_up

37079c1

Updates from airqo staging

Update cleaning for raw data to handle non airqo devices data

78d3de1

Update config with new tables

f9559ce

Merge branch 'staging' into update-fix/Clean_up

c50c928

Updates from airqo staging

coderabbitai bot reviewed Dec 12, 2024

View reviewed changes

NicholasTurner23 requested a review from Baalmart December 12, 2024 17:45

Baalmart merged commit 1396ff4 into airqo-platform:staging Dec 12, 2024
45 of 46 checks passed

Baalmart mentioned this pull request Dec 12, 2024

move to production #4059

Merged

1 task

coderabbitai bot mentioned this pull request Dec 13, 2024

Update fix/clean up #4062

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update fix/clean up #4058

Update fix/clean up #4058

NicholasTurner23 commented Dec 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Dec 12, 2024

Update fix/clean up #4058

Update fix/clean up #4058

Conversation

NicholasTurner23 commented Dec 12, 2024 • edited by coderabbitai bot Loading

Description

Related Issues

Summary by CodeRabbit

coderabbitai bot commented Dec 12, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 12, 2024

Choose a reason for hiding this comment

NicholasTurner23 commented Dec 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)