Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverted colors when extracting CMYK image #2931

Open
AnzhiZhang opened this issue Nov 1, 2024 · 1 comment
Open

Inverted colors when extracting CMYK image #2931

AnzhiZhang opened this issue Nov 1, 2024 · 1 comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-images From a users perspective, image handling is the affected feature/workflow

Comments

@AnzhiZhang
Copy link

When page.images is used to read images, the color becomes incorrect. However, when replacing it, pypdf calls the same function to read the image again, and the image is in the correct color space. I will explain more in the issue analysis section below.

origin output
1730481539 303873 1730481539 2681706

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '43.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter

def replace(filename):
    writer = PdfWriter(clone_from=filename)

    for page in writer.pages:
        for img in page.images:
            img.replace(img.image)

    filename = filename.replace(".pdf", "_out.pdf")
    with open(filename, "wb") as f:
        writer.write(f)

replace("example.pdf")

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

example.pdf

I personally fine with adding it to test. However, this is modified from http://paper.people.com.cn/rmrb/images/2024-10/28/03/rmrb2024102803.pdf and it may have some copywrite issues. It would be better to create a new PDF file with a CMYK image if it can reproduce the issue.

Traceback

This is the complete traceback I see:

***/python.exe ***/test.py

Process finished with exit code 0

Issue Analysis

page.images calls PageObject._get_image() function in the _page.py file. Also img.replace() function also calls the same _get_image() function twice in the ImageFile.replace() by reader.pages[0].images[0].

pypdf/pypdf/_page.py

Lines 632 to 669 in 98aa974

def _get_image(
self,
id: Union[str, List[str], Tuple[str]],
obj: Optional[DictionaryObject] = None,
) -> ImageFile:
if obj is None:
obj = cast(DictionaryObject, self)
if isinstance(id, tuple):
id = list(id)
if isinstance(id, List) and len(id) == 1:
id = id[0]
try:
xobjs = cast(
DictionaryObject, cast(DictionaryObject, obj[PG.RESOURCES])[RES.XOBJECT]
)
except KeyError:
if not (id[0] == "~" and id[-1] == "~"):
raise
if isinstance(id, str):
if id[0] == "~" and id[-1] == "~":
if self.inline_images is None:
self.inline_images = self._get_inline_images()
if self.inline_images is None: # pragma: no cover
raise KeyError("No inline image can be found")
return self.inline_images[id]
imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
extension, byte_stream = imgd[:2]
f = ImageFile(
name=f"{id[1:]}{extension}",
data=byte_stream,
image=imgd[2],
indirect_reference=xobjs[id].indirect_reference,
)
return f
else: # in a sub object
ids = id[1:]
return self._get_image(ids, cast(DictionaryObject, xobjs[id[0]]))

pypdf/pypdf/_page.py

Lines 398 to 401 in 98aa974

assert reader.pages[0].images[0].indirect_reference is not None
self.indirect_reference.pdf._objects[self.indirect_reference.idnum - 1] = (
reader.pages[0].images[0].indirect_reference.get_object()
)

By editing the _get_image() function:

a = cast(DictionaryObject, xobjs[id])
print(a.get("/Decode"))
imgd = _xobj_to_image(a)

Here is the new output:

***/python.exe ***/test.py
[0, 1, 0, 1, 0, 1, 0, 1]
[1, 0, 1, 0, 1, 0, 1, 0]
[1, 0, 1, 0, 1, 0, 1, 0]

Process finished with exit code 0

One decode output is used when reading page.images, and two are called when replacing. Here is the reason of the issue: image decode is wrong when reading it.

imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))

Now I would like to bring your attention to this function _xobj_to_image() in filters.py

img = _apply_decode(img, x_object_obj, lfilters, color_space, invert_color)

The error decode will cause an image with the wrong color space.

@stefan6419846 stefan6419846 changed the title CMYK Image Decode Error Inverted colors when extracting CMYK image Nov 1, 2024
@stefan6419846 stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-images From a users perspective, image handling is the affected feature/workflow labels Nov 1, 2024
@stefan6419846
Copy link
Collaborator

Thanks for the report. There is no real need to do any replacements here. The following code is sufficient:

>>> from pypdf import PdfReader
>>> reader = PdfReader('example.pdf')
>>> for page in reader.pages:
...   for image in page.images:
...     image.image.save(image.name)
... 
>>>

Doing some quick tests, it seems like neither MuPDF nor poppler (through pdfimages) are able to extract the image correctly at the moment as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-images From a users perspective, image handling is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants