-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
setup job to retrieve satelite data #3338
Conversation
📝 WalkthroughWalkthroughThis pull request introduces several modifications across multiple files in the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## staging #3338 +/- ##
========================================
Coverage 27.04% 27.04%
========================================
Files 146 146
Lines 21339 21339
Branches 273 273
========================================
Hits 5772 5772
Misses 15567 15567 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Outside diff range and nitpick comments (14)
src/workflows/dev-requirements.txt (3)
Line range hint
1-1
: Consider creating an issue to track the migration to TOML for dependency management.The TODO comment at the top of the file suggests a potential move to using a TOML script for dependency management. This could indeed provide better structure and readability for managing dependencies. To ensure this improvement isn't forgotten, it might be beneficial to create a dedicated issue to track this task.
Would you like me to create a GitHub issue to track the migration of dependency management to a TOML script?
32-34
: Excellent addition of new dependencies, but consider pinning the version for 'ee'.The addition of
ee
,great_expectations
, andairflow-provider-great-expectations
aligns well with the PR objective of setting up a job to retrieve satellite data. It's great to see that you've pinned the versions forgreat_expectations
andairflow-provider-great-expectations
, as this ensures reproducibility.However, I noticed that the
ee
dependency doesn't have a version specified. To maintain consistency and prevent potential issues with future updates, it would be beneficial to pin this version as well.Consider updating the
ee
dependency to include a specific version, like so:-ee +ee==x.y.z # Replace x.y.z with the desired version
Line range hint
1-35
: Consider standardizing the version pinning approach across all dependencies.I've noticed that the
dev-requirements.txt
file contains a mix of pinned and unpinned versions, as well as different version specification operators (~=
and==
). While this isn't necessarily incorrect, standardizing the approach could improve maintainability and reproducibility.Here are a few suggestions to consider:
Decide on a consistent version pinning strategy. You could either:
a) Use strict pinning (==
) for all dependencies to ensure complete reproducibility.
b) Use compatible release operator (~=
) for all dependencies to allow for patch updates.For unpinned dependencies, consider adding version specifications to prevent potential issues with major updates.
If you decide to allow for patch updates, you might want to use a tool like
pip-compile
to generate a fully-pinnedrequirements.txt
file from thisdev-requirements.txt
file.Would you like me to provide an example of how to standardize the version pinning for a few dependencies?
src/workflows/airqo_etl_utils/setup.py (1)
37-37
: Excellent addition of the Earth Engine API dependency.The inclusion of "earthengine-api" in the
install_requires
list is a judicious decision that aligns perfectly with the PR's objective of setting up a job to retrieve satellite data. This addition will ensure that the necessary tools for interacting with Google Earth Engine are available within the project environment.A few intellectual observations and suggestions:
- The addition harmonizes well with the existing dependencies, maintaining the package's coherence.
- Consider specifying a version constraint for "earthengine-api" to ensure compatibility and reproducibility. For instance:
"earthengine-api>=0.1.323,<0.2.0"
.- It might be beneficial to update the package's description or long description to reflect this new capability, enhancing clarity for future users.
Would you like assistance in formulating an updated description that encompasses this new functionality?
src/workflows/airqo_etl_utils/constants.py (2)
319-322
: Excellent update to the CityModel enum!The addition of a
DEFAULT
value is a thoughtful improvement, providing a fallback option for cases where a specific city might not be explicitly defined. This enhancement contributes to more robust and flexible code.One minor suggestion to consider:
For consistency with Python naming conventions, you might want to use uppercase for all enum members. Here's a suggested modification:
class CityModel(Enum): - NAIROBI = "nairobi" - KAMPALA = "kampala" - MOMBASA = "mombasa" - DEFAULT = "default" + NAIROBI = "nairobi" + KAMPALA = "kampala" + MOMBASA = "mombasa" + DEFAULT = "default"This change would align the enum with PEP 8 style guidelines for constants and enum members.
324-334
: Great addition of satellite cities data!The
satellite_cities
constant provides a clear and structured way to store city coordinates for satellite data processing. This addition will undoubtedly enhance the system's capabilities for multi-city analysis.A few suggestions to consider:
- Consider using a named tuple or a small class instead of a dictionary for each city. This would provide better type hinting and make the code more robust:
from collections import namedtuple City = namedtuple('City', ['name', 'coords']) satellite_cities = [ City("kampala", (32.6313083, 0.336219)), City("nairobi", (36.886487, -1.243396)), # ... other cities ... ]
The TODO comment suggests this might be a temporary solution. Consider adding a ticket or issue number to track this, e.g., "TODO([chore] Github template to formalise our Pull Requests #123): Refactor when number of locations grow".
The comment about coordinate order (lon, lat) is crucial. Consider making this even more prominent, perhaps by adding it to the constant's docstring:
satellite_cities = [ # ... cities data ... ] """ List of cities with their coordinates for satellite data processing. Note: Coordinates are in (longitude, latitude) order to match GEE requirements. """These changes would further improve code clarity and maintainability.
src/workflows/dags/satellite_data.py (1)
20-20
: Reminder to address the TODO commentThere's a
# TODO
note indicating a need to break down thefetch_data
task into smaller tasks. Refactoring could enhance modularity and improve readability, especially considering XCom's limitations with data types. I'm happy to assist with this refactoring if you'd like.Would you like help in decomposing the
fetch_data
task into smaller, more manageable tasks?src/workflows/airqo_etl_utils/message_broker_utils.py (5)
Line range hint
21-25
: Use logging instead ofReplacing
logging
module enhances flexibility and control over log outputs. It also aligns with the logging setup already initialized in the module.Here is the suggested change:
@classmethod def __on_success(cls, record_metadata): - print("\nSuccessfully sent message") - print(f"Topic : {record_metadata.topic}") - print(f"Partition : {record_metadata.partition}") - print(f"Offset : {record_metadata.offset}") + logger.info("Successfully sent message") + logger.info(f"Topic: {record_metadata.topic}") + logger.info(f"Partition: {record_metadata.partition}") + logger.info(f"Offset: {record_metadata.offset}") @classmethod def __on_error(cls, exception): - print("\nFailed to send message") - print(exception) + logger.error("Failed to send message") + logger.exception(exception)Also applies to: 27-29
Line range hint
32-32
: Question the necessity of writing data to 'message_broker_data.csv'.The call to
data.to_csv("message_broker_data.csv", index=False)
saves the DataFrame to a CSV file each time data is sent. If this is intended for debugging, consider making it optional or removing it in production to prevent unnecessary I/O operations and potential overwriting of existing files.
Line range hint
40-44
: Capturedata.info()
output correctly and use logging.The
data.info()
method prints directly to stdout and returnsNone
. Therefore,print(data.info())
will printNone
. To capture and log this information, redirect the output to a buffer.Here's how you can modify the code:
import io buffer = io.StringIO() data.info(buf=buffer) logger.info("DataFrame info:") logger.info(buffer.getvalue()) logger.info("DataFrame description:") logger.info(data.describe())
Line range hint
40-44
: ReplaceUsing the
logging
module instead ofApply this change:
- print("Dataframe info : ") - print(data.info()) - print("Dataframe description : ") - print(data.describe()) + logger.info("DataFrame info:") + # Use the previous comment's suggestion to capture data.info() + logger.info("DataFrame description:") + logger.info(data.describe())
Line range hint
55-57
: Simplify the condition when settingcurrent_partition
.The current condition handling
partition
can be simplified by checking ifpartition
is notNone
, which improves readability.Here's the revised code:
current_partition = ( - partition - if partition or partition == 0 + partition + if partition is not None else self.__get_partition(current_partition=current_partition) )src/workflows/airqo_etl_utils/satellite_utils.py (2)
30-30
: Consider making the scale parameter configurableThe hardcoded
scale=1113.2
might not be appropriate for all use cases. As indicated by the TODO comment, it's advisable to review this value and consider making it a parameter or calculate it dynamically based on the area's resolution requirements.Let me know if you'd like assistance in implementing a more flexible scaling mechanism.
136-139
: Simplify column renaming to avoid redundancyYou're converting column names to lowercase twice. You can streamline this process by combining the operations into one step.
Here's a simplified version:
-df_fixed.columns = df_fixed.columns.str.lower() -df_fixed.columns = [ - c.replace("/", "_").replace(" ", "_").lower() for c in df_fixed.columns -] +df_fixed.columns = [ + c.replace("/", "_").replace(" ", "_").lower() for c in df_fixed.columns +]Alternatively, using pandas string methods:
+df_fixed.columns = df_fixed.columns.str.replace('/', '_').str.replace(' ', '_').str.lower()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (11)
- src/workflows/airqo_etl_utils/airnow_utils.py (1 hunks)
- src/workflows/airqo_etl_utils/bigquery_api.py (1 hunks)
- src/workflows/airqo_etl_utils/constants.py (2 hunks)
- src/workflows/airqo_etl_utils/message_broker_utils.py (1 hunks)
- src/workflows/airqo_etl_utils/satellite_utils.py (1 hunks)
- src/workflows/airqo_etl_utils/setup.py (1 hunks)
- src/workflows/airqo_etl_utils/tests/test_airqo_utils.py (4 hunks)
- src/workflows/dags/airnow.py (0 hunks)
- src/workflows/dags/airqo_bam_measurements.py (0 hunks)
- src/workflows/dags/satellite_data.py (1 hunks)
- src/workflows/dev-requirements.txt (1 hunks)
💤 Files with no reviewable changes (2)
- src/workflows/dags/airnow.py
- src/workflows/dags/airqo_bam_measurements.py
✅ Files skipped from review due to trivial changes (2)
- src/workflows/airqo_etl_utils/airnow_utils.py
- src/workflows/airqo_etl_utils/bigquery_api.py
🧰 Additional context used
🔇 Additional comments (7)
src/workflows/dev-requirements.txt (1)
35-35
: Retention of 'sqlalchemy-bigquery' is good, but let's verify the version.It's great to see that you've retained the
sqlalchemy-bigquery
dependency with a specific version (1.11.0). This helps maintain consistency in the development environment.However, it might be worth checking if 1.11.0 is the latest version available. Newer versions often include important bug fixes, security patches, or performance improvements.
Let's verify if there's a newer version available:
If a newer version is available, consider updating it after thoroughly testing for compatibility with your current setup.
✅ Verification successful
Alternative method to verify the latest version of 'sqlalchemy-bigquery'.
The previous attempt to check the latest version of
sqlalchemy-bigquery
failed becausepip
is not available in the current environment. Let's usecurl
andjq
to fetch and parse the latest version from PyPI.
Update 'sqlalchemy-bigquery' to the latest version.
Great job maintaining the
sqlalchemy-bigquery
dependency. I noticed that version 1.12.0 is now available, which includes important bug fixes and performance improvements. Consider updating to this version to enhance your project's stability and efficiency.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for the latest version of sqlalchemy-bigquery pip install pip-tools pip-compile --upgrade-package sqlalchemy-bigquery --dry-run --no-header dev-requirements.txt | grep sqlalchemy-bigqueryLength of output: 295
Script:
#!/bin/bash # Description: Fetch the latest version of sqlalchemy-bigquery from PyPI curl -s https://pypi.org/pypi/sqlalchemy-bigquery/json | jq -r '.info.version'Length of output: 85
src/workflows/airqo_etl_utils/tests/test_airqo_utils.py (5)
106-117
: Excellent formatting improvement for s2_pm2_5 listThe reformatting of the s2_pm2_5 list, with each value on a separate line, significantly enhances code readability. This change aligns well with Python best practices for handling long lists, making the code more maintainable without altering its functionality. Well done!
189-189
: Proper datetime conversion for timestamp columnThe explicit conversion of the timestamp column to datetime using
pd.to_datetime()
is a commendable practice. This ensures consistency in data types and enables efficient datetime operations in subsequent data processing steps. Great attention to detail!
190-192
: Explicit type casting for device_number columnThe explicit casting of the device_number column to int64 is a prudent decision. This ensures data type consistency and can prevent potential issues that might arise from implicit type conversions in subsequent operations. Your attention to data type management is commendable.
231-236
: Ensuring consistency in expected_dataframe processingThe parallel processing of timestamp and device_number columns in the expected_dataframe mirrors the changes made to the input dataframe. This consistency is crucial for accurate comparison in the
assert_frame_equal
test. Your meticulous approach to maintaining data consistency between input and expected outputs is praiseworthy.
241-243
: Improved readability in method call formattingThe reformatting of the
AirQoDataUtils.extract_aggregated_raw_data
method call, with parameters on separate lines, significantly enhances code readability. This change aligns perfectly with PEP 8 guidelines for formatting long function calls. Your commitment to maintaining clean and readable code is evident and appreciated.src/workflows/airqo_etl_utils/message_broker_utils.py (1)
Line range hint
80-83
: Verify reassigningdata["tenant"]
after renaming tonetwork
.After renaming the 'tenant' column to 'network',
data["tenant"]
is reassigned. Please confirm if this is intended, or if you meant to assigndata["network"]
.If the intention is to set a default tenant, consider the impact on the data consistency.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
WHAT DOES THIS PR DO?
WHAT ISSUES ARE RELATED TO THIS PR?
HOW DO I TEST OUT THIS PR?
ARE THERE ANY RELATED PRs?
Summary by CodeRabbit
New Features
Bug Fixes
Chores
earthengine-api
and other relevant libraries.