Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update fix/optimize historical hourly measurements #3444

Conversation

NicholasTurner23
Copy link
Contributor

@NicholasTurner23 NicholasTurner23 commented Sep 16, 2024

Description

This PR reduces the resource utilization on the server by pushing most of the load to bigquery which is optimized for data querying and light aggregation.

This should solve the Negsignal.SIGKILL where the os was killing the celery workers due to resource exhaustion.

Related Issues

  • JIRA cards:
    • OPS-285

Summary by CodeRabbit

  • New Features

    • Enhanced data extraction capabilities with a new dynamic_query parameter for more flexible querying.
    • Improved schema management through the introduction of a new schema mapping.
  • Bug Fixes

    • Corrected formatting in documentation for better clarity.
  • Refactor

    • Simplified data processing logic in the data extraction function.
    • Updated methods to handle lists of column data types for improved versatility.
  • Chores

    • Renamed schema mapping for better readability and maintainability.

Copy link
Contributor

coderabbitai bot commented Sep 16, 2024

Walkthrough

Walkthrough

The pull request introduces significant modifications to the extract_aggregated_raw_data function, enhancing its flexibility by adding a dynamic_query parameter. It simplifies the internal logic by directly returning raw measurements from the BigQuery API, rather than performing data aggregation within the function. Additionally, the BigQueryApi class has been updated to improve schema management and error handling, while the Config class now uses constants for schema mappings. These changes collectively streamline data handling processes across the codebase.

Changes

Files Change Summary
src/workflows/airqo_etl_utils/airqo_utils.py Modified extract_aggregated_raw_data to include dynamic_query parameter and simplified internal logic to return raw measurements directly from BigQuery API.
src/workflows/airqo_etl_utils/bigquery_api.py Enhanced BigQueryApi class with schema mapping from configuration, updated validate_data and get_columns methods to accept lists of column types, and added dynamic_query to query_data, along with a new dynamic_averaging_query method.
src/workflows/airqo_etl_utils/config.py Renamed BIGQUERY_SCHEMA_MAPPING to SCHEMA_FILE_MAPPING, updating to use constants for better readability.
src/workflows/airqo_etl_utils/data_validator.py Adjusted remove_outliers function to pass a list of column types to get_columns method.
src/workflows/airqo_etl_utils/date.py Corrected formatting of the docstring in get_dag_date_time_values.
src/workflows/dags/airqo_measurements.py Added dynamic_query parameter to the call of extract_aggregated_raw_data in extract_device_measurements function.

Possibly related PRs

Suggested labels

python, ready for review, priority-medium

Poem

In code we trust, with changes bright,
Flexibility blooms, like stars at night.
From raw data's grasp, we now can see,
A dynamic dance of queries, oh so free!
With constants guiding, our paths align,
In the world of code, all will be fine! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    -- I pushed a fix in commit <commit_id>, please review it.
    -- Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    -- @coderabbitai generate unit testing code for this file.
    -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    -- @coderabbitai generate interesting stats about this repository and render them as a table.
    -- @coderabbitai read src/utils.ts and generate unit testing code.
    -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5b4cb05 and a97dfdf.

Files selected for processing (6)
  • src/workflows/airqo_etl_utils/airqo_utils.py (1 hunks)
  • src/workflows/airqo_etl_utils/bigquery_api.py (6 hunks)
  • src/workflows/airqo_etl_utils/config.py (1 hunks)
  • src/workflows/airqo_etl_utils/data_validator.py (1 hunks)
  • src/workflows/airqo_etl_utils/date.py (1 hunks)
  • src/workflows/dags/airqo_measurements.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • src/workflows/airqo_etl_utils/date.py
Additional context used
Ruff
src/workflows/airqo_etl_utils/bigquery_api.py

168-168: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Additional comments not posted (12)
src/workflows/airqo_etl_utils/data_validator.py (3)

85-85: LGTM!

The change to pass a list of column types to the get_columns method is consistent with the other calls in the function and does not affect the core logic.


88-88: Looks good!

Similar to the previous change, passing a list of column types to the get_columns method is consistent and does not affect the function's logic.


91-93: Nicely done!

The change to pass a list of column types to the get_columns method is consistent with the other calls in the function. Splitting the method call across multiple lines also improves readability without affecting the functionality.

src/workflows/airqo_etl_utils/config.py (1)

221-244: Excellent refactoring to improve code readability and maintainability!

The change from using string literals to defined constants for the dictionary keys is a great enhancement. It makes the codebase more clear and less error-prone.

The functionality remains intact, as the mappings still point to the same JSON schema files. Overall, this is a positive change that improves the quality of the codebase without altering the behavior.

src/workflows/dags/airqo_measurements.py (1)

38-40: The addition of the dynamic_query parameter aligns with the PR objective of leveraging BigQuery for improved efficiency.

Please ensure that the dynamic_query functionality within the extract_aggregated_raw_data method is thoroughly tested and documented to avoid potential issues.

To verify the impact of this change, consider running the following script:

Verification successful

The dynamic_query parameter is well-implemented and aligns with the PR objective.

I've verified that the dynamic_query parameter is correctly implemented and documented in the codebase. Here are the key findings:

  • The parameter is defined with a default value of False, ensuring backward compatibility.
  • It's well-documented: "A boolean value to signal bypassing the automatic query composition to a more dynamic averaging approach."
  • The code uses conditional statements based on dynamic_query, indicating different behavior when enabled.

While we couldn't locate the exact extract_aggregated_raw_data function or AirQoDataUtils class, the usage of dynamic_query in the codebase suggests it's being utilized as intended.

Great job on implementing this feature! It should indeed contribute to improved efficiency by leveraging BigQuery's capabilities.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage and impact of the `dynamic_query` parameter.

# Test 1: Search for the usage of `dynamic_query` within `extract_aggregated_raw_data`. 
# Expect: Proper utilization of the parameter to enable flexible querying.
ast-grep --lang python --pattern $'def extract_aggregated_raw_data($$$):
  $$$
  if dynamic_query:
    $$$
  $$$'

# Test 2: Compare query performance with and without `dynamic_query`. 
# Expect: Improved performance with `dynamic_query` enabled.
echo "Manually test query performance and document the results."

# Test 3: Verify the documentation of the `dynamic_query` parameter.
# Expect: Clear explanation of the parameter's purpose and usage.
rg --type python $'dynamic_query'

Length of output: 303


Script:

#!/bin/bash

# Search for the extract_aggregated_raw_data function
echo "Searching for extract_aggregated_raw_data function:"
ast-grep --lang python --pattern 'def extract_aggregated_raw_data($$$):
  $$$'

# Search for dynamic_query in Python files
echo -e "\nSearching for dynamic_query in Python files:"
fd -e py -x rg 'dynamic_query' {}

# Search for AirQoDataUtils class
echo -e "\nSearching for AirQoDataUtils class:"
ast-grep --lang python --pattern 'class AirQoDataUtils($$$):
  $$$'

Length of output: 849

src/workflows/airqo_etl_utils/bigquery_api.py (6)

23-23: LGTM!

Initializing the schema_mapping attribute from the configuration is a good practice for maintainability and flexibility.


139-139: LGTM!

The changes to accept a list of ColumnDataType enums and retrieve columns based on the list of column types are logically correct and allow for more versatile column type filtering.

Also applies to: 145-145, 151-151


166-168: LGTM!

The changes to accept a list of column types and retrieve columns based on the list are logically correct and allow for more flexible column retrieval. The improved error handling for invalid tables is also a good addition.

Tools
Ruff

168-168: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


518-539: LGTM!

The compose_query method correctly composes a SQL query for BigQuery based on the query type and other parameters. The docstring provides a clear description of the method's parameters and return value.


617-621: LGTM!

The changes to the query_data method, including the new dynamic_query parameter and the usage of the dynamic_averaging_query method, allow for a more dynamic approach to querying data. The dynamic_averaging_query method constructs a SQL query to average numeric columns dynamically, which can be useful for certain use cases. The docstring provides a clear description of the method's parameters and return value.

Also applies to: 623-638, 639-653, 660-662


665-737: LGTM!

The dynamic_averaging_query method constructs a dynamic SQL query to select and average numeric columns based on the provided parameters, which can be useful for certain use cases. The validation for the time granularity parameter ensures that only valid values are accepted. The docstring provides a clear description of the method's parameters, return value, and includes an example usage, which enhances the method's usability and maintainability.

src/workflows/airqo_etl_utils/airqo_utils.py (1)

120-122: Verify the usage of the dynamic_query parameter.

The addition of the dynamic_query parameter suggests support for different querying scenarios, which is a good enhancement. However, the parameter is not being utilized within the updated function body.

Please ensure that the dynamic_query parameter is being used as intended. If it's not needed, consider removing it to avoid confusion.

self, table: str = "all", column_type: ColumnDataType = ColumnDataType.NONE
self,
table: str = "all",
column_type: List[ColumnDataType] = [ColumnDataType.NONE],
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use None instead of a mutable data structure for the default argument.

The static analysis tool suggests using None instead of a mutable data structure for the default argument and initializing it within the function.

Apply this diff to fix the issue:

-    def get_columns(
-        self,
-        table: str = "all",
-        column_type: List[ColumnDataType] = [ColumnDataType.NONE],
-    ) -> List[str]:
+    def get_columns(
+        self,
+        table: str = "all",
+        column_type: List[ColumnDataType] = None,
+    ) -> List[str]:
+        column_type = column_type or [ColumnDataType.NONE]

Do you want me to open a GitHub issue to track this change?

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
column_type: List[ColumnDataType] = [ColumnDataType.NONE],
def get_columns(
self,
table: str = "all",
column_type: List[ColumnDataType] = None,
) -> List[str]:
column_type = column_type or [ColumnDataType.NONE]
Tools
Ruff

168-168: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an implementation that requires a list just below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicholasTurner23, understood. Thanks for the clarification.

(_/)
(•.•)
(⊃🔍⊂)


Learnings added
Learnt from: NicholasTurner23
PR: airqo-platform/AirQo-api#3444
File: src/workflows/airqo_etl_utils/bigquery_api.py:168-168
Timestamp: 2024-09-16T19:16:23.374Z
Learning: In the `get_columns` method of `bigquery_api.py`, the default value for `column_type` must be a list due to implementation requirements.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Contributor

@Baalmart Baalmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants