Improvements to the logging logic #3137

apyrgio · 2024-02-06T21:17:43Z

This PR makes the following improvements to the logging logic of the fitz module:

Removes some commented-out print() / log() call, which are no longer necessary.
Logs by default to stderr.
Replaces any print() / sys.stdout.write() / sys.stderr.write() / PySys_Write* calls with just our log() helper.
Fixes a bug that unexpectedly made sys.stdout point to sys.stderr.

Fixes #3135

github-actions · 2024-02-06T21:18:00Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

apyrgio · 2024-02-06T21:18:51Z

I have read the CLA Document and I hereby sign the CLA

apyrgio · 2024-02-07T14:33:25Z

I see that the last error that the CI test fails with is:

  *** Error in `/tmp/tmp.mJ3LG94vml/venv/bin/python': corrupted double-linked list: 0x0000000003d911e0 ***

This looks unrelated to my PR, but it's worth checking it locally as well. Thanks for your patience, I'll see that the tests pass locally and I'll update this PR.

In the meantime, it would help if you could validate the original assumption of this PR, that logging/warnings should be logged only to stderr. I'm asking because there may be historic reasons why this is not the case, and it's good to know before diving in.

julian-smith-artifex-com · 2024-02-07T22:08:27Z

Thanks for this.

I see that the last error that the CI test fails with is:
  *** Error in `/tmp/tmp.mJ3LG94vml/venv/bin/python': corrupted double-linked list: 0x0000000003d911e0 ***
This looks unrelated to my PR, but it's worth checking it locally as well. Thanks for your patience, I'll see that the tests pass locally and I'll update this PR.

Apologies for this, it looks like there's a recent regression in MuPDF master that is breaking PyMuPDF tests. We're investigating things at the moment.

In the meantime, it would help if you could validate the original assumption of this PR, that logging/warnings should be logged only to stderr. I'm asking because there may be historic reasons why this is not the case, and it's good to know before diving in.

There are some subtleties in the way we output diagnostics with print() and log(). Basically print() is for user messages, and log() is for debugging and internal development. This is why log() always prefixes with file:line:fnname, which is useful for development, but generally not appropriate for user messages. [Having said that, removing the handful of calls to PySys_Write*() would make sense.]

So i think we should retain the current distinction between print() and log().

Some other comments:

Regarding white space in blank lines, personally i like to retain them, but i suspect i'm in a tiny minority here!
Commented-out log() invocations and print() calls can be useful when developing and investigating bugs, as they are already set up ready to show useful info. However we should probably not have commented-out print() calls - if they could be useful, they should be changed to commented-out calls to log().
PyMuPDF is not used in pipelines so i think the stdout/stderr distinction doesn't actually matter. Personally in this case i prefer to use stdout for everything, to simplify sending all output to a file (directly or with tee). And maybe more importantly it avoids buffering/flushing problems that can result in unexpected ordering in the output; this can be really confusing when debugging.

apyrgio · 2024-02-08T20:34:39Z

Apologies for this, it looks like there's a recent regression in MuPDF master that is breaking PyMuPDF tests. We're investigating things at the moment.

Cool, thanks for letting me know 🙂

There are some subtleties in the way we output diagnostics with print() and log(). Basically print() is for user messages, and log() is for debugging and internal development. [...]
So i think we should retain the current distinction between print() and log().

Got it, I had the same suspicion as well.

PyMuPDF is not used in pipelines so i think the stdout/stderr distinction doesn't actually matter.

Oh, is this a conscious decision of the project, or is it because there was never a use case for it? In our case (Dangerzone), we do use PyMuPDF as part of a pipeline. We convert a document to pixels in one container, print those pixels to stdout, and then we reconstruct the document in a second container from these pixels. We do this for PDF sanitization.

Personally in this case i prefer to use stdout for everything, to simplify sending all output to a file (directly or with tee). And maybe more importantly it avoids buffering/flushing problems that can result in unexpected ordering in the output; this can be really confusing when debugging.

I definitely agree with having a single stream for these messages, and I've experienced the buffering problems you describe when this is not the case. That's why in this PR, I tried to switch everything to one stream, stderr, which was used already in some cases.

However, I'm not particularly hung up on the way maintainers prefer to print debug logs. The actual reason I'm sending this PR is because a certain print statement is currently interleaved with our own messages:

PyMuPDF/src/__init__.py

Line 9636 in d20c4e7

log('using mupdf.python_buffer_data()')

So, I can close this PR and send a less opinionated one, that simply:

Comments out that particular log line
Fixes the unexpected stdout -> stderr redirection (70b069b)

Would this be more preferable?

PyMuPDF 1.23.9 made the swapped the new fitz implementation (fitz_new) with the fitz module. In the new module there are prints in the code that interfere with our stderror for sending JSON from the container. Pinning the version seems to have no adverse consequences, since fitz_old hasn't had significant changes and it gives breething room for the print-related issue to be tackled in PR [2]. Fixes temporarily #700 [1]: #700 (comment) [2]: pymupdf/PyMuPDF#3137

PyMuPDF 1.23.9 made the swapped the new fitz implementation (fitz_new) with the fitz module. In the new module there are prints in the code that interfere with our stderror for sending JSON from the container. Pinning the version seems to have no adverse consequences [1], since fitz_old hasn't had significant changes and it gives breething room for the print-related issue to be tackled in PR [2]. Fixes temporarily #700 [1]: #700 (comment) [2]: pymupdf/PyMuPDF#3137

PyMuPDF 1.23.9 swapped the new fitz implementation (fitz_new) with the fitz module. In the new module there are prints in the code that interfere with our stdout for sending JSON from the container. Pinning the version seems to have no adverse consequences [1], since fitz_old hasn't had significant changes and it gives breathing room for the print-related issue to be tackled in PR [2]. Fixes temporarily #700 [1]: #700 (comment) [2]: pymupdf/PyMuPDF#3137

apyrgio · 2024-03-04T13:36:22Z

So, I can close this PR and send a less opinionated one, that simply:

Comments out that particular log line

Fixes the unexpected stdout -> stderr redirection (70b069b)

Would this be more preferable?

Kind reminder about the above. Would you prefer to close this PR and send a smaller one, with the above points fixed?

julian-smith-artifex-com · 2024-03-04T14:31:46Z

Apologies for the delay in replying, and thanks for the reminder and your understanding.

I think your suggestion of a simplified PR would be great at this stage.

I'll discuss internally about your use of PyMuPDF in a pipeline. If you're doing it, it needs to be supported. Perhaps we could control where logs/diagnostics go using environment variables, as i can't see a single solution that will address the different concerns that we've been discussing here.

Thanks.

PyMuPDF has some hardcoded log messages that print to stdout [1]. We don't have a way to silence them, because they don't use the Python logging infrastructure. What we can do here is silence a particular call that's been creating debug messages. For a long term solution, we have sent a PR to the PyMuPDF team, and we will follow up there [2]. Fixes #700 [1]: #700 [2]: pymupdf/PyMuPDF#3137

github-actions bot added a commit that referenced this pull request Feb 6, 2024

@apyrgio has signed the CLA from Pull Request #3137

d20c4e7

deeplow mentioned this pull request Feb 12, 2024

Temporarily pin PyMuPDF==1.23.8 in container freedomofpress/dangerzone#703

Merged

apyrgio closed this Mar 7, 2024

apyrgio force-pushed the main branch from 1c47475 to 4ddcaa3 Compare March 7, 2024 11:53

github-actions bot locked and limited conversation to collaborators Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to the logging logic #3137

Improvements to the logging logic #3137

apyrgio commented Feb 6, 2024

github-actions bot commented Feb 6, 2024 •

edited

Loading

apyrgio commented Feb 6, 2024

apyrgio commented Feb 7, 2024

julian-smith-artifex-com commented Feb 7, 2024

apyrgio commented Feb 8, 2024

apyrgio commented Mar 4, 2024

julian-smith-artifex-com commented Mar 4, 2024

Improvements to the logging logic #3137

Improvements to the logging logic #3137

Conversation

apyrgio commented Feb 6, 2024

github-actions bot commented Feb 6, 2024 • edited Loading

apyrgio commented Feb 6, 2024

apyrgio commented Feb 7, 2024

julian-smith-artifex-com commented Feb 7, 2024

apyrgio commented Feb 8, 2024

apyrgio commented Mar 4, 2024

julian-smith-artifex-com commented Mar 4, 2024

github-actions bot commented Feb 6, 2024 •

edited

Loading