Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug-1886021: Implement GCS storage classes for GCP #6572

Merged
merged 11 commits into from
May 8, 2024
Merged

bug-1886021: Implement GCS storage classes for GCP #6572

merged 11 commits into from
May 8, 2024

Conversation

relud
Copy link
Member

@relud relud commented Apr 3, 2024

requires #6561 and #6577

test plan

run CI tests locally

make build
make lint
make test

check output to see that both aws and gcp tests are executed

check aws mode works

echo CLOUD_PROVIDER=AWS > my.env
make runservices
make setup
make run

separate shell

make shell
./bin/socorro_aws_s3.sh ls --recursive telemetry-bucket
./socorro-cmd fetch_crashids --num=3 | tee crashids | ./bin/process_crashes.sh
sleep 10
./bin/socorro_aws_s3.sh ls --recursive dev-bucket
./bin/socorro_aws_s3.sh ls --recursive telemetry-bucket
cat crashids | xargs -i bash -c 'curl -s webapp:8000/api/CrashVerify/?crash_id={}; echo'

check gcp mode works

echo CLOUD_PROVIDER=GCP > my.env
make runservices
make setup
make run

separate shell

make shell
./socorro-cmd gcs list_objects telemetry-bucket
./socorro-cmd fetch_crashids --num=3 | tee crashids | ./bin/process_crashes.sh
sleep 10
./socorro-cmd gcs list_objects dev-bucket
./socorro-cmd gcs list_objects telemetry-bucket
cat crashids | xargs -i bash -c 'curl -s webapp:8000/api/CrashVerify/?crash_id={}; echo'

expectation:

before sleep: 3 raw crashes and dumps in dev-bucket, nothing in telemetry-bucket
after sleep: 3 processed crashes added to dev-bucket and telemetry-bucket, CrashVerify api returns true for all locations

@relud relud force-pushed the gcs-storage branch 10 times, most recently from 82931c8 to 11f76be Compare April 9, 2024 21:40
@relud relud force-pushed the gcs-storage branch 2 times, most recently from 551ca5f to 523bab1 Compare April 16, 2024 17:07
@relud relud marked this pull request as ready for review April 16, 2024 18:02
@relud relud requested a review from a team as a code owner April 16, 2024 18:02
return json.loads(processed_crash_as_string)

def get(self, **kwargs):
"""Return JSON data of a crash report, given its uuid."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is copied with no changes from socorro.external.boto.crash_data.SimplifiedCrashData to make this class is compatible with SimplifiedCrashData for use in webapp/crashstats/crashstats/models.py

This code duplication could be removed by converting SimplifiedCrashData to a mixin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Return JSON data of a crash report, given its uuid."""
"""Return JSON data of a crash report, given its uuid."""
# FIXME(relud): This method is used by the webapp API middleware nonsense. It shouldn't
# exist here. We should move it to the webapp.

This get thing is bad architecture we've had for a while. It's explicitly a webapp API "middleware" thing for HTTP GET requests and we shouldn't have it in the socorro/ tree at all--it should be somewhere in webapp/.

I think we should go with what you've done here and fix it in the future.

However, we should mark it with a nasty comment because it's a nasty thing.

return json.loads(crash_report_as_str)

def get(self, **kwargs):
"""Return JSON data of a crash report, given its uuid."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is copied with no changes from socorro.external.boto.crash_data.TelemetryCrashData to make this class is compatible with TelemetryCrashData for use in webapp/crashstats/crashstats/models.py

This code duplication could be removed by converting TelemetryCrashData to a mixin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the get above, this shouldn't be in socorro/ at all. What you have is fine, but we should fix this some day. It might require removing the whole webapp API "middleware" nonsense.


:returns: generator of keys

:returns: generator of pages (lists) of object keys
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited the return type here so it was easier to write a compatible gcs equivalent

bucket_or_name=self.bucket, prefix=prefix, page_size=page_size
),
page_size,
):
Copy link
Member Author

@relud relud Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more_itertools.chunked is used to pull a page worths of blobs at a time from the iterable returned by the sdk, and doing that efficiently requires a known page size. I didn't see a default page size for gcs, but for s3 i found it was 1000, so i used that here.

worth noting that webapp/crashstats/management/commands/verifyprocessed.py is the only place using this, and it chunks the page contents by 100 for sending to gcs, so page size should be a mulitple of 100.

@relud relud requested a review from willkg April 16, 2024 18:03
Copy link
Contributor

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished the review, but figured I'd submit everything I had so far before disappearing for a few days.


# remove protocol from destination if present
destination = destination.split("://", 1)[-1]
bucket_name, _, prefix = destination.partition("/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to remember that partition exists. It solves so many problems.

bin/gcs_cli.py Outdated Show resolved Hide resolved
bin/gcs_cli.py Outdated Show resolved Hide resolved
bin/gcs_cli.py Outdated Show resolved Hide resolved
bin/gcs_cli.py Outdated Show resolved Hide resolved
bin/gcs_cli.py Outdated Show resolved Hide resolved
./socorro-cmd gcs upload "${DATADIR}" "${CRASHSTORAGE_GCS_BUCKET}"
./socorro-cmd gcs list_objects "${CRASHSTORAGE_GCS_BUCKET}"
else
./bin/socorro_aws_s3.sh mb "s3://${CRASHSTORAGE_S3_BUCKET}/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hated socorro_aws_s3.sh so much. It'll be great to see it gone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-upping this sentiment. socorro_aws_s3.sh is for the birds.

from socorro.schemas import TELEMETRY_SOCORRO_CRASH_SCHEMA


class GcsCrashStorage(CrashStorageBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably would have done GCSCrashStorage, but there's probably lots of variance in the code so there's nothing to be consistent with.

socorro/external/gcs/crashstorage.py Outdated Show resolved Hide resolved

# We don't know what type of dumps mapping we have. We do know,
# however, that by calling the memory_dump_mapping method, we will get
# a MemoryDumpMapping which is exactly what we need.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, I have no idea what this means or why we have to do this with MemoryDumpMapping. It's possible I've removed all the bits that required us to do this.

@relud relud requested a review from willkg April 17, 2024 22:12
Copy link
Contributor

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes--but generally this is good:

  • utility functions in boto probably shouldn't be imported into gcs code--that feels icky
  • build_keys should either get copied to the gcs code or moved to a neutral place
  • probably should add comments to the get methods that came from the crash_data classes
  • maybe rename upload_fileobj to upload and download_fileobj to download in the s3 helper
  • since we nixed the crash_data classes, we should update the corresponding tests
  • test function docstrings for fcs_cli are copied from the first function

Otherwise, this is good and everything I tested with CLOUD_PROVIDER=<unset> and CLOUD_PROVIDER=GCP worked fine as far as I could tell.

@gcs_group.command("upload")
@click.argument("source")
@click.argument("destination")
def upload(source, destination):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we'll want to implement a recursive download as well, but we don't have to do that in this PR. The recursive upload/download is the only reason I kept the aws cli awfulness around.

./socorro-cmd gcs upload "${DATADIR}" "${CRASHSTORAGE_GCS_BUCKET}"
./socorro-cmd gcs list_objects "${CRASHSTORAGE_GCS_BUCKET}"
else
./bin/socorro_aws_s3.sh mb "s3://${CRASHSTORAGE_S3_BUCKET}/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-upping this sentiment. socorro_aws_s3.sh is for the birds.

socorro/external/gcs/crashstorage.py Outdated Show resolved Hide resolved
list_to_str,
str_to_list,
)
from socorro.lib import external_common, MissingArgumentError, BadArgumentError, libooid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some day, I'm going to get rid of the fact we call "crash id" by three different names depending on where in the codebase we are. Some day.

return json.loads(processed_crash_as_string)

def get(self, **kwargs):
"""Return JSON data of a crash report, given its uuid."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Return JSON data of a crash report, given its uuid."""
"""Return JSON data of a crash report, given its uuid."""
# FIXME(relud): This method is used by the webapp API middleware nonsense. It shouldn't
# exist here. We should move it to the webapp.

This get thing is bad architecture we've had for a while. It's explicitly a webapp API "middleware" thing for HTTP GET requests and we shouldn't have it in the socorro/ tree at all--it should be somewhere in webapp/.

I think we should go with what you've done here and fix it in the future.

However, we should mark it with a nasty comment because it's a nasty thing.

socorro/tests/external/gcs/test_crashstorage.py Outdated Show resolved Hide resolved
socorro/tests/external/gcs/test_crashstorage.py Outdated Show resolved Hide resolved
socorro/tests/test_gcs_cli.py Outdated Show resolved Hide resolved
@relud relud requested a review from willkg May 7, 2024 21:16
@relud
Copy link
Member Author

relud commented May 7, 2024

@willkg ready for another pass

Copy link
Contributor

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The test plan helped a ton. Thank you for that!

return datestamp


class JSONISOEncoder(json.JSONEncoder):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to make this change in this PR, but this might be better in socorro/lib/libdatetime.py because there's a duplicate definition in webapp/crashstats/crashstats/utils.py.

We might even be able to replace JsonDTEncoder. They do slightly different things (one adds a T), but maybe we don't really need both?

Python 3.10.14 (main, May  6 2024, 10:26:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime
>>> now = datetime.datetime.now()
>>> now.strftime("%Y-%m-%d %H:%M:%S.%f")
'2024-05-08 15:59:42.779732'
>>> now.isoformat()
'2024-05-08T15:59:42.779732'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@relud relud merged commit 6dc30d2 into main May 8, 2024
2 checks passed
@relud relud deleted the gcs-storage branch May 8, 2024 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants