Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the logging logic #3137

Closed
wants to merge 0 commits into from
Closed

Improvements to the logging logic #3137

wants to merge 0 commits into from

Conversation

apyrgio
Copy link
Contributor

@apyrgio apyrgio commented Feb 6, 2024

This PR makes the following improvements to the logging logic of the fitz module:

  1. Removes some commented-out print() / log() call, which are no longer necessary.
  2. Logs by default to stderr.
  3. Replaces any print() / sys.stdout.write() / sys.stderr.write() / PySys_Write* calls with just our log() helper.
  4. Fixes a bug that unexpectedly made sys.stdout point to sys.stderr.

Fixes #3135

Copy link
Contributor

github-actions bot commented Feb 6, 2024

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@apyrgio
Copy link
Contributor Author

apyrgio commented Feb 6, 2024

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Feb 6, 2024
@apyrgio
Copy link
Contributor Author

apyrgio commented Feb 7, 2024

I see that the last error that the CI test fails with is:

  *** Error in `/tmp/tmp.mJ3LG94vml/venv/bin/python': corrupted double-linked list: 0x0000000003d911e0 ***

This looks unrelated to my PR, but it's worth checking it locally as well. Thanks for your patience, I'll see that the tests pass locally and I'll update this PR.

In the meantime, it would help if you could validate the original assumption of this PR, that logging/warnings should be logged only to stderr. I'm asking because there may be historic reasons why this is not the case, and it's good to know before diving in.

@julian-smith-artifex-com
Copy link
Collaborator

Thanks for this.

I see that the last error that the CI test fails with is:

  *** Error in `/tmp/tmp.mJ3LG94vml/venv/bin/python': corrupted double-linked list: 0x0000000003d911e0 ***

This looks unrelated to my PR, but it's worth checking it locally as well. Thanks for your patience, I'll see that the tests pass locally and I'll update this PR.

Apologies for this, it looks like there's a recent regression in MuPDF master that is breaking PyMuPDF tests. We're investigating things at the moment.

In the meantime, it would help if you could validate the original assumption of this PR, that logging/warnings should be logged only to stderr. I'm asking because there may be historic reasons why this is not the case, and it's good to know before diving in.

There are some subtleties in the way we output diagnostics with print() and log(). Basically print() is for user messages, and log() is for debugging and internal development. This is why log() always prefixes with file:line:fnname, which is useful for development, but generally not appropriate for user messages. [Having said that, removing the handful of calls to PySys_Write*() would make sense.]

So i think we should retain the current distinction between print() and log().

Some other comments:

  • Regarding white space in blank lines, personally i like to retain them, but i suspect i'm in a tiny minority here!
  • Commented-out log() invocations and print() calls can be useful when developing and investigating bugs, as they are already set up ready to show useful info. However we should probably not have commented-out print() calls - if they could be useful, they should be changed to commented-out calls to log().
  • PyMuPDF is not used in pipelines so i think the stdout/stderr distinction doesn't actually matter. Personally in this case i prefer to use stdout for everything, to simplify sending all output to a file (directly or with tee). And maybe more importantly it avoids buffering/flushing problems that can result in unexpected ordering in the output; this can be really confusing when debugging.

@apyrgio
Copy link
Contributor Author

apyrgio commented Feb 8, 2024

Apologies for this, it looks like there's a recent regression in MuPDF master that is breaking PyMuPDF tests. We're investigating things at the moment.

Cool, thanks for letting me know 🙂

There are some subtleties in the way we output diagnostics with print() and log(). Basically print() is for user messages, and log() is for debugging and internal development. [...]
So i think we should retain the current distinction between print() and log().

Got it, I had the same suspicion as well.

PyMuPDF is not used in pipelines so i think the stdout/stderr distinction doesn't actually matter.

Oh, is this a conscious decision of the project, or is it because there was never a use case for it? In our case (Dangerzone), we do use PyMuPDF as part of a pipeline. We convert a document to pixels in one container, print those pixels to stdout, and then we reconstruct the document in a second container from these pixels. We do this for PDF sanitization.

Personally in this case i prefer to use stdout for everything, to simplify sending all output to a file (directly or with tee). And maybe more importantly it avoids buffering/flushing problems that can result in unexpected ordering in the output; this can be really confusing when debugging.

I definitely agree with having a single stream for these messages, and I've experienced the buffering problems you describe when this is not the case. That's why in this PR, I tried to switch everything to one stream, stderr, which was used already in some cases.

However, I'm not particularly hung up on the way maintainers prefer to print debug logs. The actual reason I'm sending this PR is because a certain print statement is currently interleaved with our own messages:

log('using mupdf.python_buffer_data()')

So, I can close this PR and send a less opinionated one, that simply:

  • Comments out that particular log line
  • Fixes the unexpected stdout -> stderr redirection (70b069b)

Would this be more preferable?

deeplow added a commit to freedomofpress/dangerzone that referenced this pull request Feb 12, 2024
PyMuPDF 1.23.9 made the swapped the new fitz implementation (fitz_new)
with the fitz module. In the new module there are prints in the code
that interfere with our stderror for sending JSON from the container.
Pinning the version seems to have no adverse consequences, since
fitz_old hasn't had significant changes and it gives breething room for
the print-related issue to be tackled in PR [2].

Fixes temporarily #700

[1]: #700 (comment)
[2]: pymupdf/PyMuPDF#3137
deeplow added a commit to freedomofpress/dangerzone that referenced this pull request Feb 12, 2024
PyMuPDF 1.23.9 made the swapped the new fitz implementation (fitz_new)
with the fitz module. In the new module there are prints in the code
that interfere with our stderror for sending JSON from the container.
Pinning the version seems to have no adverse consequences [1], since
fitz_old hasn't had significant changes and it gives breething room for
the print-related issue to be tackled in PR [2].

Fixes temporarily #700

[1]: #700 (comment)
[2]: pymupdf/PyMuPDF#3137
deeplow added a commit to freedomofpress/dangerzone that referenced this pull request Feb 12, 2024
PyMuPDF 1.23.9 swapped the new fitz implementation (fitz_new)
with the fitz module. In the new module there are prints in the code
that interfere with our stdout for sending JSON from the container.
Pinning the version seems to have no adverse consequences [1], since
fitz_old hasn't had significant changes and it gives breathing room for
the print-related issue to be tackled in PR [2].

Fixes temporarily #700

[1]: #700 (comment)
[2]: pymupdf/PyMuPDF#3137
@apyrgio
Copy link
Contributor Author

apyrgio commented Mar 4, 2024

So, I can close this PR and send a less opinionated one, that simply:

  • Comments out that particular log line
  • Fixes the unexpected stdout -> stderr redirection (70b069b)

Would this be more preferable?

Kind reminder about the above. Would you prefer to close this PR and send a smaller one, with the above points fixed?

@julian-smith-artifex-com
Copy link
Collaborator

Apologies for the delay in replying, and thanks for the reminder and your understanding.

I think your suggestion of a simplified PR would be great at this stage.

I'll discuss internally about your use of PyMuPDF in a pipeline. If you're doing it, it needs to be supported. Perhaps we could control where logs/diagnostics go using environment variables, as i can't see a single solution that will address the different concerns that we've been discussing here.

Thanks.

apyrgio added a commit to freedomofpress/dangerzone that referenced this pull request Mar 5, 2024
PyMuPDF has some hardcoded log messages that print to stdout [1]. We don't
have a way to silence them, because they don't use the Python logging
infrastructure.

What we can do here is silence a particular call that's been creating
debug messages. For a long term solution, we have sent a PR to the
PyMuPDF team, and we will follow up there [2].

Fixes #700

[1]: #700
[2]: pymupdf/PyMuPDF#3137
@apyrgio apyrgio closed this Mar 7, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not log warnings to stdout
2 participants