Inverted colors when extracting CMYK image #2931

AnzhiZhang · 2024-11-01T18:08:31Z

When page.images is used to read images, the color becomes incorrect. However, when replacing it, pypdf calls the same function to read the image again, and the image is in the correct color space. I will explain more in the issue analysis section below.

origin	output

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '43.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter

def replace(filename):
    writer = PdfWriter(clone_from=filename)

    for page in writer.pages:
        for img in page.images:
            img.replace(img.image)

    filename = filename.replace(".pdf", "_out.pdf")
    with open(filename, "wb") as f:
        writer.write(f)

replace("example.pdf")

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

example.pdf

I personally fine with adding it to test. However, this is modified from http://paper.people.com.cn/rmrb/images/2024-10/28/03/rmrb2024102803.pdf and it may have some copywrite issues. It would be better to create a new PDF file with a CMYK image if it can reproduce the issue.

Traceback

This is the complete traceback I see:

***/python.exe ***/test.py

Process finished with exit code 0

Issue Analysis

page.images calls PageObject._get_image() function in the _page.py file. Also img.replace() function also calls the same _get_image() function twice in the ImageFile.replace() by reader.pages[0].images[0].

pypdf/pypdf/_page.py

Lines 632 to 669 in 98aa974

    
           def _get_image( 
        
               self, 
        
               id: Union[str, List[str], Tuple[str]], 
        
               obj: Optional[DictionaryObject] = None, 
        
           ) -> ImageFile: 
        
               if obj is None: 
        
                   obj = cast(DictionaryObject, self) 
        
               if isinstance(id, tuple): 
        
                   id = list(id) 
        
               if isinstance(id, List) and len(id) == 1: 
        
                   id = id[0] 
        
               try: 
        
                   xobjs = cast( 
        
                       DictionaryObject, cast(DictionaryObject, obj[PG.RESOURCES])[RES.XOBJECT] 
        
                   ) 
        
               except KeyError: 
        
                   if not (id[0] == "~" and id[-1] == "~"): 
        
                       raise 
        
               if isinstance(id, str): 
        
                   if id[0] == "~" and id[-1] == "~": 
        
                       if self.inline_images is None: 
        
                           self.inline_images = self._get_inline_images() 
        
                       if self.inline_images is None:  # pragma: no cover 
        
                           raise KeyError("No inline image can be found") 
        
                       return self.inline_images[id] 
        
                   imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id])) 
        
                   extension, byte_stream = imgd[:2] 
        
                   f = ImageFile( 
        
                       name=f"{id[1:]}{extension}", 
        
                       data=byte_stream, 
        
                       image=imgd[2], 
        
                       indirect_reference=xobjs[id].indirect_reference, 
        
                   ) 
        
                   return f 
        
               else:  # in a sub object 
        
                   ids = id[1:] 
        
                   return self._get_image(ids, cast(DictionaryObject, xobjs[id[0]]))

pypdf/pypdf/_page.py

Lines 398 to 401 in 98aa974

    
           assert reader.pages[0].images[0].indirect_reference is not None 
        
           self.indirect_reference.pdf._objects[self.indirect_reference.idnum - 1] = ( 
        
               reader.pages[0].images[0].indirect_reference.get_object() 
        
           )

By editing the _get_image() function:

a = cast(DictionaryObject, xobjs[id])
print(a.get("/Decode"))
imgd = _xobj_to_image(a)

Here is the new output:

***/python.exe ***/test.py
[0, 1, 0, 1, 0, 1, 0, 1]
[1, 0, 1, 0, 1, 0, 1, 0]
[1, 0, 1, 0, 1, 0, 1, 0]

Process finished with exit code 0

One decode output is used when reading page.images, and two are called when replacing. Here is the reason of the issue: image decode is wrong when reading it.

pypdf/pypdf/_page.py

Line 658 in 98aa974

imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))

Now I would like to bring your attention to this function _xobj_to_image() in filters.py

pypdf/pypdf/filters.py

Line 793 in 98aa974

img = _apply_decode(img, x_object_obj, lfilters, color_space, invert_color)

The error decode will cause an image with the wrong color space.

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-11-01T18:27:36Z

Thanks for the report. There is no real need to do any replacements here. The following code is sufficient:

>>> from pypdf import PdfReader
>>> reader = PdfReader('example.pdf')
>>> for page in reader.pages:
...   for image in page.images:
...     image.image.save(image.name)
... 
>>>

Doing some quick tests, it seems like neither MuPDF nor poppler (through pdfimages) are able to extract the image correctly at the moment as well.

stefan6419846 changed the title ~~CMYK Image Decode Error~~ Inverted colors when extracting CMYK image Nov 1, 2024

stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-images From a users perspective, image handling is the affected feature/workflow labels Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inverted colors when extracting CMYK image #2931

Inverted colors when extracting CMYK image #2931

AnzhiZhang commented Nov 1, 2024

stefan6419846 commented Nov 1, 2024

Inverted colors when extracting CMYK image #2931

Inverted colors when extracting CMYK image #2931

Comments

AnzhiZhang commented Nov 1, 2024

Environment

Code + PDF

Traceback

Issue Analysis

stefan6419846 commented Nov 1, 2024