-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fix/clean up #4058
Update fix/clean up #4058
Conversation
Updates from airqo staging
Updates from airqo staging
📝 WalkthroughWalkthroughThe pull request introduces significant updates across several files, primarily focusing on the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
src/analytics/config.py (2)
35-37
: Document new environment variables in deployment guidesThe addition of
BIGQUERY_DEVICES_DEVICES
andBIGQUERY_SITES_SITES
configurations supports the schema changes, but requires documentation updates:
- Update deployment documentation to include these new variables
- Consider adding comments explaining the purpose of these tables
Consider creating a configuration documentation file that explains:
- Purpose of each table
- Relationships between tables
- Required environment variables
35-38
: Consider grouping related BigQuery configurationsThe BigQuery configurations are growing in number. Consider organizing related configurations together for better maintainability.
class Config: # ... other configs ... # Device-related BigQuery tables - BIGQUERY_DEVICES = env_var("BIGQUERY_DEVICES") - BIGQUERY_DEVICES_DEVICES = env_var("BIGQUERY_DEVICES_DEVICES") + BIGQUERY_TABLES_DEVICE = { + "devices": env_var("BIGQUERY_DEVICES"), + "devices_devices": env_var("BIGQUERY_DEVICES_DEVICES"), + } # Site-related BigQuery tables - BIGQUERY_SITES = env_var("BIGQUERY_SITES") - BIGQUERY_SITES_SITES = env_var("BIGQUERY_SITES_SITES") - BIGQUERY_AIRQLOUDS_SITES = env_var("BIGQUERY_AIRQLOUDS_SITES") + BIGQUERY_TABLES_SITE = { + "sites": env_var("BIGQUERY_SITES"), + "sites_sites": env_var("BIGQUERY_SITES_SITES"), + "airqlouds_sites": env_var("BIGQUERY_AIRQLOUDS_SITES"), + }src/workflows/airqo_etl_utils/meta_data_utils.py (1)
149-149
: LGTM! Consider moving timestamp logic to a utility functionThe addition of the
last_updated
timestamp using UTC is correct and well-placed. However, since this pattern might be used in other places (like in theextract_devices_from_api
method), consider extracting it to a utility function.Consider creating a utility function:
@staticmethod def add_last_updated(df: pd.DataFrame) -> pd.DataFrame: df["last_updated"] = datetime.now(timezone.utc) return dfsrc/workflows/airqo_etl_utils/bigquery_api.py (1)
Line range hint
545-545
: Remove debug file write operation.The line
dataframe.to_csv("raw_data50.csv")
appears to be debug code that was accidentally left in. Writing to files in production can cause disk space and performance issues.- dataframe.to_csv("raw_data50.csv")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
src/analytics/api/models/events.py
(9 hunks)src/analytics/config.py
(1 hunks)src/workflows/airqo_etl_utils/bigquery_api.py
(2 hunks)src/workflows/airqo_etl_utils/config.py
(2 hunks)src/workflows/airqo_etl_utils/meta_data_utils.py
(1 hunks)src/workflows/airqo_etl_utils/schema/devices.json
(2 hunks)src/workflows/airqo_etl_utils/schema/sites.json
(1 hunks)src/workflows/dags/meta_data.py
(2 hunks)
🔇 Additional comments (9)
src/workflows/airqo_etl_utils/schema/devices.json (1)
25-25
: Schema changes require careful migration planning
The changes to make device_id
and last_updated
required fields improve data integrity by ensuring essential tracking information is always present. However, this introduces stricter requirements that need careful consideration:
- Existing records might need updating to comply with the new schema
- Data ingestion processes must be updated to always provide these fields
Also applies to: 55-55
✅ Verification successful
Let me search for more specific patterns to understand the data ingestion and validation mechanisms in place. The previous search might have been too specific, so I'm broadening the scope while maintaining focus on device-related operations.
Schema change appears safe to proceed
Based on the codebase analysis, the schema changes to make device_id
and last_updated
required fields appear safe to implement because:
-
Existing code already treats
device_id
as a critical field:- Device validation logic is in place across multiple services
- The BigQuery connector already validates device_id presence (
device_id is not None and device_id.strip() != ""
)
-
Data ingestion processes handle these fields appropriately:
- The Kafka-BigQuery connector includes device_id in its transformations
- ETL utilities consistently include device-related fields in their schemas
- Validation mechanisms are present in the device-registry service
The change aligns with existing practices and won't disrupt current operations since the fields are already being treated as essential in the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any existing records that might violate the new schema
# Search for relevant data ingestion code
rg -A 3 "INSERT.*devices.*table"
# Look for potential data validation or transformation logic
ast-grep --pattern 'def $_(self, $$$):
$$$
device_id
$$$'
Length of output: 100
Script:
#!/bin/bash
# Let's try a broader search for device-related ingestion patterns
rg -l "device.*INSERT|INSERT.*device"
# Search for BigQuery table operations
rg -A 3 "bigquery.*device"
# Look for data transformation or ETL files
fd -t f "etl|transform|ingest"
# Check for any validation logic in Python files
rg -A 3 "device_id.*None|device_id.*required"
# Search for any device-related schema validation
rg -A 3 "schema.*device|validate.*device"
Length of output: 11287
src/workflows/airqo_etl_utils/schema/sites.json (1)
69-69
: Verify data preservation strategy for schema changes
The change from DATE
to TIMESTAMP
for last_updated
provides better temporal precision, which aligns well with real-time data tracking needs. However, there are some considerations:
- The increased precision requires proper timezone handling
- Existing DATE values will need conversion to TIMESTAMP
✅ Verification successful
Timezone handling is properly implemented for TIMESTAMP data
The codebase shows comprehensive timezone handling across multiple components:
- ETL workflows consistently use UTC timezone (
datetime.now(timezone.utc)
) for data processing - BigQuery operations handle timezone conversion (
pd.to_datetime(..., utc=True)
) - The
last_updated
field is consistently populated with UTC timestamps in data processing flows
The change from DATE to TIMESTAMP is well-supported by the existing timezone-aware infrastructure, ensuring data consistency and proper temporal tracking.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Look for timezone handling in the codebase
rg -A 3 "datetime.*utc|timezone"
# Search for date/time conversion logic
ast-grep --pattern 'def $_(self, $$$):
$$$
last_updated
$$$'
Length of output: 54290
src/workflows/airqo_etl_utils/config.py (1)
60-63
: LGTM! Verify environment variables in deployment
The new BigQuery table configurations and their schema mappings are correctly implemented. Ensure these environment variables are set in all deployment environments.
Run the following script to verify the environment variables:
Also applies to: 376-381
src/workflows/airqo_etl_utils/bigquery_api.py (2)
49-49
: LGTM! New table references properly integrated.
The addition of sites_sites_table
and devices_devices_table
is well-structured and consistent with the existing pattern of table reference declarations. These additions align with the PR objective of introducing new tables for improved data management.
Also applies to: 58-58
Line range hint 558-577
: Excellent docstring improvements!
The updated docstring is comprehensive and well-structured, providing clear documentation of:
- Method purpose and behavior
- Data cleaning steps
- Parameters and return values
- Exception handling
src/analytics/api/models/events.py (4)
30-30
: LGTM! Constants properly defined.
The new constants BIGQUERY_SITES_SITES
and BIGQUERY_DEVICES_DEVICES
are well-defined and consistent with the existing naming pattern. They properly reference the configuration values using f-strings.
Also applies to: 32-32
56-56
: LGTM! Instance variables properly initialized.
The new instance variables are correctly initialized using their corresponding constants, maintaining consistency with the existing initialization pattern.
Also applies to: 59-59
67-68
: LGTM! Query properties updated consistently.
The query properties have been properly updated to use the new table references while maintaining the existing query structure and functionality.
Also applies to: 74-74, 79-79
99-100
: LGTM! Query methods updated consistently.
The query methods have been properly updated to use the new table references while preserving the existing query logic and functionality.
Also applies to: 117-118, 134-135
@@ -46,7 +46,7 @@ def load_sites(data: pd.DataFrame): | |||
big_query_api = BigQueryApi() | |||
big_query_api.update_sites_and_devices( | |||
dataframe=data, | |||
table=big_query_api.sites_table, | |||
table=big_query_api.sites_sites_table, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent table references across DAGs
While the first DAG has been updated to use the new table references (sites_sites_table
and devices_devices_table
), the second DAG "Update-BigQuery-Sites-Grids-And-Devices" still uses the old references (sites_table
and devices_table
). This inconsistency could lead to data synchronization issues.
Apply this diff to update the second DAG:
def load_sites(data: pd.DataFrame):
from airqo_etl_utils.bigquery_api import BigQueryApi
big_query_api = BigQueryApi()
big_query_api.update_sites_and_devices(
dataframe=data,
- table=big_query_api.sites_table,
+ table=big_query_api.sites_sites_table,
component="sites",
)
def load_devices(data: pd.DataFrame):
from airqo_etl_utils.bigquery_api import BigQueryApi
big_query_api = BigQueryApi()
big_query_api.update_sites_and_devices(
dataframe=data,
- table=big_query_api.devices_table,
+ table=big_query_api.devices_devices_table,
component="devices",
)
Also applies to: 78-78
Description
[Provide a brief description of the changes made in this PR]
Related Issues
Summary by CodeRabbit
New Features
Bug Fixes
Schema Updates
Documentation