Update fix/optimize historical hourly measurements #3444

NicholasTurner23 · 2024-09-16T19:05:29Z

Description

This PR reduces the resource utilization on the server by pushing most of the load to bigquery which is optimized for data querying and light aggregation.

This should solve the Negsignal.SIGKILL where the os was killing the celery workers due to resource exhaustion.

Related Issues

JIRA cards:
- OPS-285

Summary by CodeRabbit

New Features
- Enhanced data extraction capabilities with a new dynamic_query parameter for more flexible querying.
- Improved schema management through the introduction of a new schema mapping.
Bug Fixes
- Corrected formatting in documentation for better clarity.
Refactor
- Simplified data processing logic in the data extraction function.
- Updated methods to handle lists of column data types for improved versatility.
Chores
- Renamed schema mapping for better readability and maintainability.

…ient pandas execution and code clean up

…surements Updates from airqo staging

coderabbitai · 2024-09-16T19:05:39Z

Walkthrough

The pull request introduces significant modifications to the extract_aggregated_raw_data function, enhancing its flexibility by adding a dynamic_query parameter. It simplifies the internal logic by directly returning raw measurements from the BigQuery API, rather than performing data aggregation within the function. Additionally, the BigQueryApi class has been updated to improve schema management and error handling, while the Config class now uses constants for schema mappings. These changes collectively streamline data handling processes across the codebase.

Changes

Files	Change Summary
`src/workflows/airqo_etl_utils/airqo_utils.py`	Modified `extract_aggregated_raw_data` to include `dynamic_query` parameter and simplified internal logic to return raw measurements directly from BigQuery API.
`src/workflows/airqo_etl_utils/bigquery_api.py`	Enhanced `BigQueryApi` class with schema mapping from configuration, updated `validate_data` and `get_columns` methods to accept lists of column types, and added `dynamic_query` to `query_data`, along with a new `dynamic_averaging_query` method.
`src/workflows/airqo_etl_utils/config.py`	Renamed `BIGQUERY_SCHEMA_MAPPING` to `SCHEMA_FILE_MAPPING`, updating to use constants for better readability.
`src/workflows/airqo_etl_utils/data_validator.py`	Adjusted `remove_outliers` function to pass a list of column types to `get_columns` method.
`src/workflows/airqo_etl_utils/date.py`	Corrected formatting of the docstring in `get_dag_date_time_values`.
`src/workflows/dags/airqo_measurements.py`	Added `dynamic_query` parameter to the call of `extract_aggregated_raw_data` in `extract_device_measurements` function.

Possibly related PRs

Setup insights microservice #3168: Modifications to extract_aggregated_raw_data may relate to overall data handling improvements in the insights microservice setup.
Update datetime usage: change datetime.utcnow() to datetime.now(timez… #3229: Updates to datetime usage across the codebase could impact timestamp handling in extract_aggregated_raw_data.
Cleanup air qo realtime low cost measurements #3234: Changes in airqo_measurements.py include updates to datetime handling, affecting data extraction processes.
Update/implement celery k8s executor #3415: Enhancements to exception handling in the codebase may relate to error handling in extract_aggregated_raw_data.
Increase resources for Celery #3437: Increased resources for Celery may indirectly relate to the performance of data extraction tasks.

Suggested labels

python, ready for review, priority-medium

Poem

In code we trust, with changes bright,
Flexibility blooms, like stars at night.
From raw data's grasp, we now can see,
A dynamic dance of queries, oh so free!
With constants guiding, our paths align,
In the world of code, all will be fine! 🌟

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-- I pushed a fix in commit <commit_id>, please review it.
-- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
-- @coderabbitai generate unit testing code for this file.
-- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
-- @coderabbitai generate interesting stats about this repository and render them as a table.
-- @coderabbitai read src/utils.ts and generate unit testing code.
-- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5b4cb05 and a97dfdf.

Files selected for processing (6)

src/workflows/airqo_etl_utils/airqo_utils.py (1 hunks)
src/workflows/airqo_etl_utils/bigquery_api.py (6 hunks)
src/workflows/airqo_etl_utils/config.py (1 hunks)
src/workflows/airqo_etl_utils/data_validator.py (1 hunks)
src/workflows/airqo_etl_utils/date.py (1 hunks)
src/workflows/dags/airqo_measurements.py (1 hunks)

Files skipped from review due to trivial changes (1)

src/workflows/airqo_etl_utils/date.py

Additional context used

Ruff

src/workflows/airqo_etl_utils/bigquery_api.py

168-168: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Additional comments not posted (12)

src/workflows/airqo_etl_utils/data_validator.py (3)

85-85: LGTM!

The change to pass a list of column types to the get_columns method is consistent with the other calls in the function and does not affect the core logic.

88-88: Looks good!

Similar to the previous change, passing a list of column types to the get_columns method is consistent and does not affect the function's logic.

91-93: Nicely done!

The change to pass a list of column types to the get_columns method is consistent with the other calls in the function. Splitting the method call across multiple lines also improves readability without affecting the functionality.

src/workflows/airqo_etl_utils/config.py (1)

221-244: Excellent refactoring to improve code readability and maintainability!

The change from using string literals to defined constants for the dictionary keys is a great enhancement. It makes the codebase more clear and less error-prone.

The functionality remains intact, as the mappings still point to the same JSON schema files. Overall, this is a positive change that improves the quality of the codebase without altering the behavior.
src/workflows/dags/airqo_measurements.py (1)
38-40: The addition of the dynamic_query parameter aligns with the PR objective of leveraging BigQuery for improved efficiency.

Please ensure that the dynamic_query functionality within the extract_aggregated_raw_data method is thoroughly tested and documented to avoid potential issues.

To verify the impact of this change, consider running the following script:

Verification successful

The dynamic_query parameter is well-implemented and aligns with the PR objective.

I've verified that the dynamic_query parameter is correctly implemented and documented in the codebase. Here are the key findings:

The parameter is defined with a default value of False, ensuring backward compatibility.

It's well-documented: "A boolean value to signal bypassing the automatic query composition to a more dynamic averaging approach."

The code uses conditional statements based on dynamic_query, indicating different behavior when enabled.

While we couldn't locate the exact extract_aggregated_raw_data function or AirQoDataUtils class, the usage of dynamic_query in the codebase suggests it's being utilized as intended.

Great job on implementing this feature! It should indeed contribute to improved efficiency by leveraging BigQuery's capabilities.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage and impact of the `dynamic_query` parameter.

# Test 1: Search for the usage of `dynamic_query` within `extract_aggregated_raw_data`. 
# Expect: Proper utilization of the parameter to enable flexible querying.
ast-grep --lang python --pattern $'def extract_aggregated_raw_data($$$):
  $$$
  if dynamic_query:
    $$$
  $$$'

# Test 2: Compare query performance with and without `dynamic_query`. 
# Expect: Improved performance with `dynamic_query` enabled.
echo "Manually test query performance and document the results."

# Test 3: Verify the documentation of the `dynamic_query` parameter.
# Expect: Clear explanation of the parameter's purpose and usage.
rg --type python $'dynamic_query'
Length of output: 303

Script:
#!/bin/bash

# Search for the extract_aggregated_raw_data function
echo "Searching for extract_aggregated_raw_data function:"
ast-grep --lang python --pattern 'def extract_aggregated_raw_data($$$):
  $$$'

# Search for dynamic_query in Python files
echo -e "\nSearching for dynamic_query in Python files:"
fd -e py -x rg 'dynamic_query' {}

# Search for AirQoDataUtils class
echo -e "\nSearching for AirQoDataUtils class:"
ast-grep --lang python --pattern 'class AirQoDataUtils($$$):
  $$$'
Length of output: 849
src/workflows/airqo_etl_utils/bigquery_api.py (6)

23-23: LGTM!

Initializing the schema_mapping attribute from the configuration is a good practice for maintainability and flexibility.

139-139: LGTM!

The changes to accept a list of ColumnDataType enums and retrieve columns based on the list of column types are logically correct and allow for more versatile column type filtering.

Also applies to: 145-145, 151-151

166-168: LGTM!

The changes to accept a list of column types and retrieve columns based on the list are logically correct and allow for more flexible column retrieval. The improved error handling for invalid tables is also a good addition.

Tools

Ruff

168-168: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

518-539: LGTM!

The compose_query method correctly composes a SQL query for BigQuery based on the query type and other parameters. The docstring provides a clear description of the method's parameters and return value.

617-621: LGTM!

The changes to the query_data method, including the new dynamic_query parameter and the usage of the dynamic_averaging_query method, allow for a more dynamic approach to querying data. The dynamic_averaging_query method constructs a SQL query to average numeric columns dynamically, which can be useful for certain use cases. The docstring provides a clear description of the method's parameters and return value.

Also applies to: 623-638, 639-653, 660-662

665-737: LGTM!

The dynamic_averaging_query method constructs a dynamic SQL query to select and average numeric columns based on the provided parameters, which can be useful for certain use cases. The validation for the time granularity parameter ensures that only valid values are accepted. The docstring provides a clear description of the method's parameters, return value, and includes an example usage, which enhances the method's usability and maintainability.

src/workflows/airqo_etl_utils/airqo_utils.py (1)

120-122: Verify the usage of the dynamic_query parameter.

The addition of the dynamic_query parameter suggests support for different querying scenarios, which is a good enhancement. However, the parameter is not being utilized within the updated function body.

Please ensure that the dynamic_query parameter is being used as intended. If it's not needed, consider removing it to avoid confusion.

coderabbitai · 2024-09-16T19:10:34Z

src/workflows/airqo_etl_utils/bigquery_api.py

-        self, table: str = "all", column_type: ColumnDataType = ColumnDataType.NONE
+        self,
+        table: str = "all",
+        column_type: List[ColumnDataType] = [ColumnDataType.NONE],


Use None instead of a mutable data structure for the default argument.

The static analysis tool suggests using None instead of a mutable data structure for the default argument and initializing it within the function.

Apply this diff to fix the issue:

- def get_columns( - self, - table: str = "all", - column_type: List[ColumnDataType] = [ColumnDataType.NONE], - ) -> List[str]: + def get_columns( + self, + table: str = "all", + column_type: List[ColumnDataType] = None, + ) -> List[str]: + column_type = column_type or [ColumnDataType.NONE]

Do you want me to open a GitHub issue to track this change?

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

column_type: List[ColumnDataType] = [ColumnDataType.NONE],

def get_columns(

self,

table: str = "all",

column_type: List[ColumnDataType] = None,

) -> List[str]:

column_type = column_type or [ColumnDataType.NONE]

Tools

Ruff

168-168: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

There is an implementation that requires a list just below.

@NicholasTurner23, understood. Thanks for the clarification.

(_/)
(•.•)
(⊃🔍⊂)

Learnings added

Learnt from: NicholasTurner23 PR: airqo-platform/AirQo-api#3444 File: src/workflows/airqo_etl_utils/bigquery_api.py:168-168 Timestamp: 2024-09-16T19:16:23.374Z Learning: In the `get_columns` method of `bigquery_api.py`, the default value for `column_type` must be a list due to implementation requirements.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Baalmart

thanks @NicholasTurner23

NicholasTurner23 added 5 commits September 16, 2024 21:53

Remove loop that consumes a lot of resources

a8f7d37

Add dynamic averaging to push the load to biquery instead of inneffic…

8d264b3

…ient pandas execution and code clean up

Implement new changes to the historical hourly pipeline

17d3ca2

Code clean up

1836880

Merge branch 'staging' into update-fix/optimize_historical_hourly_mea…

a97dfdf

…surements Updates from airqo staging

NicholasTurner23 requested a review from Baalmart September 16, 2024 19:05

NicholasTurner23 requested review from BenjaminSsempala, Mnoble-19 and Psalmz777 September 16, 2024 19:05

coderabbitai bot reviewed Sep 16, 2024

View reviewed changes

Baalmart approved these changes Sep 16, 2024

View reviewed changes

Baalmart merged commit b4ef239 into airqo-platform:staging Sep 16, 2024
44 checks passed

Baalmart mentioned this pull request Sep 16, 2024

move to production #3445

Merged

2 tasks

coderabbitai bot mentioned this pull request Sep 26, 2024

Update fix/reduce redundant device data called[wip] #3526

Merged

1 task

coderabbitai bot mentioned this pull request Oct 11, 2024

setup job for pollution predictions using satellite data #3343

Merged

5 tasks

This was referenced Oct 23, 2024

Update/kafka implementations #3734

Merged

Update/kafka implementations #3752

Merged

Update/kafka implementations #3760

Merged

This was referenced Dec 11, 2024

Refactor fix update/code clean up #4042

Merged

Update fix/clean up #4066

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update fix/optimize historical hourly measurements #3444

Update fix/optimize historical hourly measurements #3444

NicholasTurner23 commented Sep 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 16, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested labels

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Sep 16, 2024 •

edited

Loading

NicholasTurner23 Sep 16, 2024

coderabbitai bot Sep 16, 2024

Baalmart left a comment

-        column_type: List[ColumnDataType] = [ColumnDataType.NONE],
+    def get_columns(
+        self,
+        table: str = "all",
+        column_type: List[ColumnDataType] = None,
+    ) -> List[str]:
+        column_type = column_type or [ColumnDataType.NONE]

Update fix/optimize historical hourly measurements #3444

Update fix/optimize historical hourly measurements #3444

Conversation

NicholasTurner23 commented Sep 16, 2024 • edited by coderabbitai bot Loading

Description

Related Issues

Summary by CodeRabbit

coderabbitai bot commented Sep 16, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested labels

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

NicholasTurner23 Sep 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 16, 2024

Choose a reason for hiding this comment

Baalmart left a comment

Choose a reason for hiding this comment

NicholasTurner23 commented Sep 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 16, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Sep 16, 2024 •

edited

Loading