-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Note" non-fatal INTAKE behaviors #3510
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After accidentally observing an uploaded dataset that was marked "archiveonly" not indexed
I can't quite parse that...is it missing a word or two?
I stripped the rather prolific logging of INTAKE to just two, a "pre" log identifying the file and user, and a "post" log providing the summary file system status (adding the size of the tarball).
This seems generally good, although I for some reason I'm sad to lose the file system status from the "pre" message (I don't know why). However, I do have a concern that it won't be sufficiently easy to match the "post" message with its corresponding "pre" message, but perhaps that's just because I haven't seen it in action. Also, I'm wondering about the fact that the "post" message is positioned sort of in the middle of the request.
I debated logging the "notes", but didn't. [...] it's easy enough to do an Audit query on the dataset resource_id even when the client doesn't capture the "notes".
In this instance, I'm not sure that less is more. I agree that we shouldn't be logging every audit record, and I support avoiding overloading the log. However, I don't think it is "easy" to do an audit query, relative to looking at the log in Opensearch. I wonder if we should be splitting the difference: I think there are some notes which are worth logging (like the fact that the Server decided on its own not to index the result) while others can rest in the audit (like projected expiration dates).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems generally good, although I for some reason I'm sad to lose the file system status from the "pre" message (I don't know why). However, I do have a concern that it won't be sufficiently easy to match the "post" message with its corresponding "pre" message, but perhaps that's just because I haven't seen it in action. Also, I'm wondering about the fact that the "post" message is positioned sort of in the middle of the request.
The "post" message comes after everything but the final score keeping: we've settled all the files in their final places. But, yeah, it has sometimes bugged me that I didn't put it later: just not quite enough to move it. There are cases that'll back out after here, so in one sense the "final" disk report is redundant since we'll be freeing that space. At the time, though, I wanted to know that, and I didn't want yet another "mid" report. Logically, it might make more sense to move this to where I do the Sync
and the final audit log; but I'm reluctant to either add another log here or to leave it unlogged. Hmm...
I debated logging the "notes", but didn't. [...] it's easy enough to do an Audit query on the dataset resource_id even when the client doesn't capture the "notes".
In this instance, I'm not sure that less is more. I agree that we shouldn't be logging every audit record, and I support avoiding overloading the log. However, I don't think it is "easy" to do an audit query, relative to looking at the log in Opensearch. I wonder if we should be splitting the difference: I think there are some notes which are worth logging (like the fact that the Server decided on its own not to index the result) while others can rest in the audit (like projected expiration dates).
I guess I'll split the difference and add a warning
log on the no-metadata path to clearly mark it.
PBENCH-1231 After accidentally observing an uploaded dataset that was marked "archiveonly" not indexed (because the file name didn't match the result directory name), we decided that the intake process should make the reasons more obvious. I've handled this by adding "notes", which are recorded both in the successful JSON response payload and in the finalized audit record. While I was at it, I exposed our decisions regarding the canonical benchmarked workload name and the computed expiration date. While doing this, I stripped the rather prolific logging of `INTAKE` to just two, a "pre" log identifying the file and user, and a "post" log providing the summary file system status (adding the size of the tarball). I debated logging the "notes", but didn't. On one hand, it would make the info more easily visible; on another it restores verbosity, and the notes strings are going to be unwieldy in the log. I've also several times considered that we might log every audit record, which would include attributes... but might be hard to read in the linear log format. In any case, it's easy enough to do an Audit query on the dataset resource_id even when the client doesn't capture the "notes".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
PBENCH-1231
After accidentally observing an uploaded dataset that was marked "archiveonly" not indexed (because the file name didn't match the result directory name), we decided that the intake process should make the reasons more obvious.
I've handled this by adding "notes", which are recorded both in the successful JSON response payload and in the finalized audit record. While I was at it, I exposed our decisions regarding the canonical benchmarked workload name and the computed expiration date.
While doing this, I stripped the rather prolific logging of
INTAKE
to just two, a "pre" log identifying the file and user, and a "post" log providing the summary file system status (adding the size of the tarball).I debated logging the "notes", but didn't. On one hand, it would make the info more easily visible; on another it restores verbosity, and the notes strings are going to be unwieldy in the log. I've also several times considered that we might log every audit record, which would include attributes... but might be hard to read in the linear log format.
In any case, it's easy enough to do an Audit query on the dataset resource_id even when the client doesn't capture the "notes".