-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove single quotes in values of Ingestion Server's TSV files #4471
Conversation
d0caa71
to
6e52ba2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm able to see the rows I updated getting cleaned, and all the expected logs in the ingestion server locally -- but I don't see the actual file in MinIO 🤔 No errors of any kind. Any ideas what could be going wrong?
continue | ||
update_field_expressions.append(f"{field} = '{clean_value}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this moved? I think this was easier to parse when the comment was before the continue
, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL fields had single quotes added twice. As you can see above in the file (lines 117 and 119), I removed the quotes from the cleaned value so they won't appear in the file either (single quotes as the quoting character are problematic for later loading in the DB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also confused about this 😕
Where were the duplicate quotes coming from, exactly? And why does the clean_value
for tags not need to be quoted but everything else does?
Maybe avoiding the continue would help the clarity of this block:
for field, clean_value in cleaned_data.items():
if field != "tags":
# Save cleaned values for later
# (except for tags, which take up too much space)
cleaned_values[field].append((identifier, clean_value))
update_field_expressions.append(f"{field} = {clean_value}")
Never mind that I don't understand where the difference is coming from for the format of the string added to update_field_expressions
, this version is a lot easier to understand, to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where were the duplicate quotes coming from, exactly?
The cleaned values have quotes added in the return of the cleaning function:
openverse/ingestion_server/ingestion_server/cleanup.py
Lines 108 to 111 in d562f47
if tls_supported: | |
return f"'https://{url}'" | |
else: | |
return f"'http://{url}'" |
And why does the clean_value for tags not need to be quoted but everything else does?
The psycopg2.extras.Json
function for tags already adapts a Python object to :sql:json
data type, so they don't need extra quotes wrapping.
The difference in the string added to update_field_expressions
is the quotes. I assume this was previously done that way to avoid this confusion, but it's necessary to have the cleaned values without these quotes in the file so they don't interfere with the copy upload to a DB table later.
An alternative would be to perform more string manipulation to remove the quotes before saving the rows in files, but this seemed unnecessary and more prone to error to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! That's fine with me. I don't think it's worth re-working this too much since these steps are going to be removed afterward anyway, but it would be good to expand that comment and move it before the continue
to make this more clear, since multiple people found it confusing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Staci, but I also don't want to block on it because of this code being deleted soon anyway 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment. If it's still not clear, it's open to suggestions. I didn't think it would bring so much confusion 😅
@stacimc That sounds like, potentially, you don't have the You can manually create the bucket through MinIO, or running a |
9fae52d
to
f22023c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes to simplify the get_s3_resource
function (or even just remove it), clarify further the changes regarding update_field_expressions
, and clarify the loop in _upload_to_s3
(specifically the case where the file does not exist).
continue | ||
update_field_expressions.append(f"{field} = '{clean_value}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also confused about this 😕
Where were the duplicate quotes coming from, exactly? And why does the clean_value
for tags not need to be quoted but everything else does?
Maybe avoiding the continue would help the clarity of this block:
for field, clean_value in cleaned_data.items():
if field != "tags":
# Save cleaned values for later
# (except for tags, which take up too much space)
cleaned_values[field].append((identifier, clean_value))
update_field_expressions.append(f"{field} = {clean_value}")
Never mind that I don't understand where the difference is coming from for the format of the string added to update_field_expressions
, this version is a lot easier to understand, to me.
""" | ||
Locally, it connects to a MinIO instance through its endpoint and test credentials. | ||
On live environments, the connection is allowed via IAM roles. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion to clarify this comment. I wasn't sure what "it" referred to (might have both been MinIO and boto3?), and the reason for the difference between local and live wasn't clear (like, why MinIO matters).
""" | |
Locally, it connects to a MinIO instance through its endpoint and test credentials. | |
On live environments, the connection is allowed via IAM roles. | |
""" | |
""" | |
Retrieve a correctly configured boto3 S3 resource. | |
Locally, we use MinIO to emulate S3, so we must specify the endpoint and credentials. | |
On live environments, S3 itself is used, and authentication works via the instance profile. | |
""" |
Although, overall, I do think this would be clearer if we didn't use default values here, and instead required setting them, and allowed them to be None
when not configured in the environment. Then the code wouldn't need to worry about the implementation detail at all. I'll share a suggestion for that code in my next comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use Minio to emulate and test some AWS operations. In this case, it's the replacement for S3. Otherwise, we would need to use some live resources, which would complicate things a lot and likely incur costs.
if not file_path.exists(): | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case is this file missing not an error condition (in other words, why can we just continue
here rather than raise an exception)? The intention for this loop would be a lot safer and more explicit if instead of looping through fields
, we only looped through the specific ones we want to upload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we start fixing the rows, we will reach the point where there is nothing to clean, so there is no file to upload in that case, and the tags
field won't have a file here either.
f22023c
to
b4172e4
Compare
@stacimc @sarayourfriend I split the PR into two parts since many things were happening here and it's easier to explain by simplifying it. Here, I left the fix for the quotes in the TSV files, updated the testing instructions, and opened #4529 for the upload to S3. I hope I have answered all your questions. Let me know if anything needs further clarification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krysal I'm not sure what's happening, but I'm getting really strange behavior when testing this. Locally in my catalog I updated three different records, removing the protocol from the url
on one, the creator_url
on the second, and the foreign_landing_url
on the third. When I ran the data refresh only the foreign_landing_url
actually got cleaned, and I still am not seeing the file in minio even though I see no errors in my ingestion server logs. Edit: I see that you moved the s3 uploading out of this PR so that part makes sense.
I didn't have much time to look further into this, but is anyone else able to replicate my results?
continue | ||
update_field_expressions.append(f"{field} = '{clean_value}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! That's fine with me. I don't think it's worth re-working this too much since these steps are going to be removed afterward anyway, but it would be good to expand that comment and move it before the continue
to make this more clear, since multiple people found it confusing!
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @dhruvkb Excluding weekend1 days, this PR was ready for review 3 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to replicate Staci's results. I modified an image in the catalog to remove the scheme from all three URLs, and all three have separate files correctly adding the scheme.
I think there are ways to make the intention of the code clearer, but as I said, I am unwilling to block the PR given the code will be gone soon anyway. That said, I would be more firm in requesting clarification comments or reorganisation of that loop if this code was long term, so if similar code will be introduced elsewhere, I would block on its clarification. Just wanted to clarify it's a matter of the circumstance, not that I'm otherwise ignoring the issue of the clarity of that loop in this review.
continue | ||
update_field_expressions.append(f"{field} = '{clean_value}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Staci, but I also don't want to block on it because of this code being deleted soon anyway 🤷
b4172e4
to
08ab020
Compare
@stacimc Did you try doing a full @sarayourfriend unrelated but I was about to suggest using
The plain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes make sense based on the PR description. I followed the testing instructions and verified the correct output in /tmp
.
@krysal I did recreate, but if three other people are unable to reproduce I'm willing to believe something was going awry in my environment. No objections from me 👍 |
Fixes
Related to #3912 by @krysal
Description
This PR fixes several issues on the Ingestion Server and prepares it to upload the files.
'
, which isn't necessary and makes it quite complicated to save the data in a database table laterTesting Instructions
Make some rows of the image table in the catalog "dirty" by removing the protocol in one or more of the URLs fields (url, creator_url, or foreign_landing_url)
Run an image data refresh
and that the content of each file is the identifier and values without quotes, eg.:
While in our bucket, you can find the manually uploaded files containing rows like the following:
Checklist
Update index.md
).main
) or a parent feature branch.just catalog/generate-docs
for catalog PRs) or the media properties generator (just catalog/generate-docs media-props
for the catalog orjust api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin