-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log the full TableMetadata #458
base: main
Are you sure you want to change the base?
Conversation
polaris-service/src/test/java/org/apache/polaris/service/PolarisApplicationIntegrationTest.java
Outdated
Show resolved
Hide resolved
@@ -1254,6 +1255,9 @@ public void doRefresh() { | |||
public void doCommit(TableMetadata base, TableMetadata metadata) { | |||
LOGGER.debug( | |||
"doCommit for table {} with base {}, metadata {}", tableIdentifier, base, metadata); | |||
LOGGER.info( | |||
"doCommit full new metadata: {}", | |||
PolarisObjectMapperUtil.serialize(getCurrentPolarisContext(), metadata)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be pretty big to log into INFO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I'm both worried about the performance hit of serializing such a large object and the amount of data logged. Some possible options:
- Move to DEBUG (still performs serialization, does nothing for people who run in INFO+ which should be common)
- Gate the entire thing behind a featureConfiguration
- Set a limit on the log output length in featureConfiguration (still performs serialization)
What are your thoughts? I'm leaning toward the 2nd one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to DEBUG (still performs serialization, does nothing for people who run in INFO+ which should be common)
Good callout. Man do I miss scala.
In this case, perhaps we can do something like:
LOGGER.info(
"doCommit full new metadata: {}",
() -> PolarisObjectMapperUtil.serialize(getCurrentPolarisContext(), metadata));
SLF4J 2.0 supports this, and in general it might be a useful wrapper for us.
Having said all of that, I'm less worried about the serde overhead, since we do so much metadata/property serde anyway, and more about emitting a huge blob into the logs. Particularly the INFO logs which should stay high-signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's nice! I updated it to DEBUG and made it use the supplier syntax.
I removed the ReportMetricsRequest logging as it seems like we already log it as a tag with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @andrew4699.
As mentioned in the other comment, this can be a really huge logged string in the range of multiple megabytes of JSON. I'm not sure whether that's a good idea. My concerns around this are:
- Increased cost when using a logging SaaS
- Huge log files
- Exposing user data in log files
I'd be much in favor of not merging this change as unconditionally, even at debug level, logging will produce a huge amount log data.
Debug logging is often used to investigate service issues - but too excessive debug logging is often not really helpful.
I would consider this probably a "trace" level event. I agree with @snazy's comments here. I would hesitate to log these whole entities as well for security reasons. |
@RussellSpitzer @snazy Thank you for the feedback. I'm new to the Iceberg space and appreciate the context you have on these objects. My hope is to make it easier to see the "unstructured" data that gets self-reported by the query engines. Would you feel more comfortable if this was scoped down to printing the last 5 Snapshot ID & summaries? |
What's the use case for logging it at all, that cannot be done from an Iceberg client or |
I have no problem with the functionality but I think it probably should be part of an eventing api. We may want to keep the complete history of the table somewhere (possibly not in the Catalog itself) |
This is intended to be server-side so Polaris knows more about its callers.
Yes I think that would be a useful API and this change could also help the project move in that direction. In some sense these 2 changes can make the value proposition clearer by first providing low-friction access to this unstructured, self-reported data. |
Description
Adds more logs. This is useful for determining the types of workloads happening on Polaris.
Type of change
How Has This Been Tested?
Checklist: