-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug-1886021: Implement GCS storage classes for GCP #6572
Conversation
82931c8
to
11f76be
Compare
551ca5f
to
523bab1
Compare
return json.loads(processed_crash_as_string) | ||
|
||
def get(self, **kwargs): | ||
"""Return JSON data of a crash report, given its uuid.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is copied with no changes from socorro.external.boto.crash_data.SimplifiedCrashData
to make this class is compatible with SimplifiedCrashData
for use in webapp/crashstats/crashstats/models.py
This code duplication could be removed by converting SimplifiedCrashData
to a mixin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Return JSON data of a crash report, given its uuid.""" | |
"""Return JSON data of a crash report, given its uuid.""" | |
# FIXME(relud): This method is used by the webapp API middleware nonsense. It shouldn't | |
# exist here. We should move it to the webapp. |
This get
thing is bad architecture we've had for a while. It's explicitly a webapp API "middleware" thing for HTTP GET requests and we shouldn't have it in the socorro/
tree at all--it should be somewhere in webapp/
.
I think we should go with what you've done here and fix it in the future.
However, we should mark it with a nasty comment because it's a nasty thing.
return json.loads(crash_report_as_str) | ||
|
||
def get(self, **kwargs): | ||
"""Return JSON data of a crash report, given its uuid.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is copied with no changes from socorro.external.boto.crash_data.TelemetryCrashData
to make this class is compatible with TelemetryCrashData
for use in webapp/crashstats/crashstats/models.py
This code duplication could be removed by converting TelemetryCrashData
to a mixin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the get
above, this shouldn't be in socorro/
at all. What you have is fine, but we should fix this some day. It might require removing the whole webapp API "middleware" nonsense.
|
||
:returns: generator of keys | ||
|
||
:returns: generator of pages (lists) of object keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edited the return type here so it was easier to write a compatible gcs equivalent
bucket_or_name=self.bucket, prefix=prefix, page_size=page_size | ||
), | ||
page_size, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more_itertools.chunked
is used to pull a page worths of blobs at a time from the iterable returned by the sdk, and doing that efficiently requires a known page size. I didn't see a default page size for gcs, but for s3 i found it was 1000, so i used that here.
worth noting that webapp/crashstats/management/commands/verifyprocessed.py
is the only place using this, and it chunks the page contents by 100 for sending to gcs, so page size should be a mulitple of 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished the review, but figured I'd submit everything I had so far before disappearing for a few days.
|
||
# remove protocol from destination if present | ||
destination = destination.split("://", 1)[-1] | ||
bucket_name, _, prefix = destination.partition("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to remember that partition exists. It solves so many problems.
./socorro-cmd gcs upload "${DATADIR}" "${CRASHSTORAGE_GCS_BUCKET}" | ||
./socorro-cmd gcs list_objects "${CRASHSTORAGE_GCS_BUCKET}" | ||
else | ||
./bin/socorro_aws_s3.sh mb "s3://${CRASHSTORAGE_S3_BUCKET}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hated socorro_aws_s3.sh
so much. It'll be great to see it gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-upping this sentiment. socorro_aws_s3.sh
is for the birds.
from socorro.schemas import TELEMETRY_SOCORRO_CRASH_SCHEMA | ||
|
||
|
||
class GcsCrashStorage(CrashStorageBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably would have done GCSCrashStorage
, but there's probably lots of variance in the code so there's nothing to be consistent with.
|
||
# We don't know what type of dumps mapping we have. We do know, | ||
# however, that by calling the memory_dump_mapping method, we will get | ||
# a MemoryDumpMapping which is exactly what we need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, I have no idea what this means or why we have to do this with MemoryDumpMapping
. It's possible I've removed all the bits that required us to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor changes--but generally this is good:
- utility functions in boto probably shouldn't be imported into gcs code--that feels icky
- build_keys should either get copied to the gcs code or moved to a neutral place
- probably should add comments to the
get
methods that came from the crash_data classes - maybe rename
upload_fileobj
toupload
anddownload_fileobj
todownload
in the s3 helper - since we nixed the crash_data classes, we should update the corresponding tests
- test function docstrings for fcs_cli are copied from the first function
Otherwise, this is good and everything I tested with CLOUD_PROVIDER=<unset>
and CLOUD_PROVIDER=GCP
worked fine as far as I could tell.
@gcs_group.command("upload") | ||
@click.argument("source") | ||
@click.argument("destination") | ||
def upload(source, destination): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, we'll want to implement a recursive download as well, but we don't have to do that in this PR. The recursive upload/download is the only reason I kept the aws cli awfulness around.
./socorro-cmd gcs upload "${DATADIR}" "${CRASHSTORAGE_GCS_BUCKET}" | ||
./socorro-cmd gcs list_objects "${CRASHSTORAGE_GCS_BUCKET}" | ||
else | ||
./bin/socorro_aws_s3.sh mb "s3://${CRASHSTORAGE_S3_BUCKET}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-upping this sentiment. socorro_aws_s3.sh
is for the birds.
list_to_str, | ||
str_to_list, | ||
) | ||
from socorro.lib import external_common, MissingArgumentError, BadArgumentError, libooid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some day, I'm going to get rid of the fact we call "crash id" by three different names depending on where in the codebase we are. Some day.
return json.loads(processed_crash_as_string) | ||
|
||
def get(self, **kwargs): | ||
"""Return JSON data of a crash report, given its uuid.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Return JSON data of a crash report, given its uuid.""" | |
"""Return JSON data of a crash report, given its uuid.""" | |
# FIXME(relud): This method is used by the webapp API middleware nonsense. It shouldn't | |
# exist here. We should move it to the webapp. |
This get
thing is bad architecture we've had for a while. It's explicitly a webapp API "middleware" thing for HTTP GET requests and we shouldn't have it in the socorro/
tree at all--it should be somewhere in webapp/
.
I think we should go with what you've done here and fix it in the future.
However, we should mark it with a nasty comment because it's a nasty thing.
webapp/crashstats/crashstats/management/commands/verifyprocessed.py
Outdated
Show resolved
Hide resolved
@willkg ready for another pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! The test plan helped a ton. Thank you for that!
return datestamp | ||
|
||
|
||
class JSONISOEncoder(json.JSONEncoder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to make this change in this PR, but this might be better in socorro/lib/libdatetime.py
because there's a duplicate definition in webapp/crashstats/crashstats/utils.py
.
We might even be able to replace JsonDTEncoder
. They do slightly different things (one adds a T
), but maybe we don't really need both?
Python 3.10.14 (main, May 6 2024, 10:26:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime
>>> now = datetime.datetime.now()
>>> now.strftime("%Y-%m-%d %H:%M:%S.%f")
'2024-05-08 15:59:42.779732'
>>> now.isoformat()
'2024-05-08T15:59:42.779732'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed https://bugzilla.mozilla.org/show_bug.cgi?id=1895802 to track this
requires #6561 and #6577
test plan
run CI tests locally
make build make lint make test
check output to see that both aws and gcp tests are executed
check aws mode works
separate shell
check gcp mode works
separate shell
expectation:
before sleep: 3 raw crashes and dumps in dev-bucket, nothing in telemetry-bucket
after sleep: 3 processed crashes added to dev-bucket and telemetry-bucket, CrashVerify api returns true for all locations