Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log the full TableMetadata #458

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

andrew4699
Copy link
Contributor

@andrew4699 andrew4699 commented Nov 19, 2024

Description

Adds more logs. This is useful for determining the types of workloads happening on Polaris.

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  • Added log assertions

Checklist:

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue

@@ -1254,6 +1255,9 @@ public void doRefresh() {
public void doCommit(TableMetadata base, TableMetadata metadata) {
LOGGER.debug(
"doCommit for table {} with base {}, metadata {}", tableIdentifier, base, metadata);
LOGGER.info(
"doCommit full new metadata: {}",
PolarisObjectMapperUtil.serialize(getCurrentPolarisContext(), metadata));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be pretty big to log into INFO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I'm both worried about the performance hit of serializing such a large object and the amount of data logged. Some possible options:

  • Move to DEBUG (still performs serialization, does nothing for people who run in INFO+ which should be common)
  • Gate the entire thing behind a featureConfiguration
  • Set a limit on the log output length in featureConfiguration (still performs serialization)

What are your thoughts? I'm leaning toward the 2nd one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to DEBUG (still performs serialization, does nothing for people who run in INFO+ which should be common)

Good callout. Man do I miss scala.

In this case, perhaps we can do something like:

      LOGGER.info(
          "doCommit full new metadata: {}",
          () -> PolarisObjectMapperUtil.serialize(getCurrentPolarisContext(), metadata));

SLF4J 2.0 supports this, and in general it might be a useful wrapper for us.

Having said all of that, I'm less worried about the serde overhead, since we do so much metadata/property serde anyway, and more about emitting a huge blob into the logs. Particularly the INFO logs which should stay high-signal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's nice! I updated it to DEBUG and made it use the supplier syntax.

@andrew4699
Copy link
Contributor Author

I removed the ReportMetricsRequest logging as it seems like we already log it as a tag with Invoking CatalogApi with params.

@andrew4699 andrew4699 changed the title Log more things: TableMetadata and ReportMetricsRequest Log the full TableMetadata Nov 20, 2024
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @andrew4699.

As mentioned in the other comment, this can be a really huge logged string in the range of multiple megabytes of JSON. I'm not sure whether that's a good idea. My concerns around this are:

  • Increased cost when using a logging SaaS
  • Huge log files
  • Exposing user data in log files

I'd be much in favor of not merging this change as unconditionally, even at debug level, logging will produce a huge amount log data.

Debug logging is often used to investigate service issues - but too excessive debug logging is often not really helpful.

@RussellSpitzer
Copy link
Member

I would consider this probably a "trace" level event. I agree with @snazy's comments here. I would hesitate to log these whole entities as well for security reasons.

@andrew4699
Copy link
Contributor Author

andrew4699 commented Nov 21, 2024

@RussellSpitzer @snazy Thank you for the feedback. I'm new to the Iceberg space and appreciate the context you have on these objects. My hope is to make it easier to see the "unstructured" data that gets self-reported by the query engines. Would you feel more comfortable if this was scoped down to printing the last 5 Snapshot ID & summaries?

@snazy
Copy link
Member

snazy commented Nov 22, 2024

@RussellSpitzer @snazy Thank you for the feedback. I'm new to the Iceberg space and appreciate the context you have on these objects. My hope is to make it easier to see the "unstructured" data that gets self-reported by the query engines. Would you feel more comfortable if this was scoped down to printing the last 5 Snapshot ID & summaries?

What's the use case for logging it at all, that cannot be done from an Iceberg client or curl?

@RussellSpitzer
Copy link
Member

@RussellSpitzer @snazy Thank you for the feedback. I'm new to the Iceberg space and appreciate the context you have on these objects. My hope is to make it easier to see the "unstructured" data that gets self-reported by the query engines. Would you feel more comfortable if this was scoped down to printing the last 5 Snapshot ID & summaries?

What's the use case for logging it at all, that cannot be done from an Iceberg client or curl?

I have no problem with the functionality but I think it probably should be part of an eventing api. We may want to keep the complete history of the table somewhere (possibly not in the Catalog itself)

@andrew4699
Copy link
Contributor Author

andrew4699 commented Nov 22, 2024

What's the use case for logging it at all, that cannot be done from an Iceberg client or curl?

This is intended to be server-side so Polaris knows more about its callers.

I have no problem with the functionality but I think it probably should be part of an eventing api. We may want to keep the complete history of the table somewhere (possibly not in the Catalog itself)

Yes I think that would be a useful API and this change could also help the project move in that direction. In some sense these 2 changes can make the value proposition clearer by first providing low-friction access to this unstructured, self-reported data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants