MAINT: Unnecessary character mapping process #2888

ssjkamei · 2024-10-04T01:13:18Z

This is a fix for the problem that occurred when #2882 was changed.

The string length of characters was checked after conversion by cmap, but after cmap conversion, there is a pattern where the string length is more than one character, and it cannot be measured accurately.

This is necessary, for example, when considering whether to measure the distance from the ligature or the base character corresponding to the ligature in fixing #1351.

The change in handle_tj is because it cannot pass Ruff's check.
Error: PLR0915 Too many statements (nnn > 176)

The following code is only used to get the character code for a space.
However, I think it would be better to split the code into parts for obtaining the character code.
Style changes are considered in another PR.

_, space_code = parse_encoding(cmap[3], space_code)
_, space_code, _ = parse_to_unicode(cmap[3], space_code)

This reverts commit 5400f5a.

This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (py-pdf#1328) BUG: Missing spaces in extract_text() method (py-pdf#1328) add test

…nt size comparison to ratio

Co-authored-by: Stefan <[email protected]>

…n efficiency

…he assertion process

Co-authored-by: Stefan <[email protected]>

codecov · 2024-10-04T01:29:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.36%. Comparing base (e825ac0) to head (b25b28f).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2888   +/-   ##
=======================================
  Coverage   96.35%   96.36%           
=======================================
  Files          52       52           
  Lines        8735     8738    +3     
  Branches     1723     1727    +4     
=======================================
+ Hits         8417     8420    +3     
  Misses        186      186           
  Partials      132      132

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ssjkamei · 2024-10-04T03:31:34Z

I have addressed the parts that were returned due to code check errors.
Also, I'm sorry, it seems that I made a mistake and the past commit history is displayed..

pypdf/_cmap.py

stefan6419846

Thanks. I could not see any apparent issue with this and all the tests still pass without having to change anything. Thus I am going to merge this for now.

stefan6419846 · 2024-10-04T09:25:04Z

Also, I'm sorry, it seems that I made a mistake and the past commit history is displayed.

Your previous PR was from your main branch, while this PR used a separate branch, but all branches usually originate from main. Thus your previous commits where included here verbosely.

I recommend you to reset the changes from your main branch and sync it with upstream. If this is too complex, consider deleting and re-creating your fork if there is no work which would be lost by such a process, and always use dedicated branches for further PRs.

@hpierre001

## What's new ### New Features (ENH) - Add `layout_mode_font_height_weight` argument to `PageObject.extract_text()` (#2920) by @hpierre001 ### Bug Fixes (BUG) - Fix font specificier for FreeText annotation (#2893) by @ssjkamei - Line breaks are not generated due to incorrect calculation of text leading (#2890) by @ssjkamei - Improve handling of spaces in text extraction (#2882) by @ssjkamei ### Robustness (ROB) - Soft failure for flate encode image mode 1 with wrong LUT size (#2900) by @stefan6419846 ### Documentation (DOC) - Use latest package versions (#2907) by @stefan6419846 - Correct example of reading FileAttachment annotation (#2906) by @j-t-1 ### Developer Experience (DEV) - Update pinned requirements (#2918) by @stefan6419846 - Make make_release.py compatible with Windows environment (#2894) by @pubpub-zz ### Maintenance (MAINT) - Remove references to outdated Python versions (#2919) by @stefan6419846 - Generalize the method of obtaining space_code (#2891) by @ssjkamei - Unnecessary character mapping process (#2888) by @ssjkamei - New LZW decoding implementation (#2887) by @MartinThoma ### Testing (TST) - Add LzwCodec for encoding (#2883) by @MartinThoma ### Code Style (STY) - Capitalize error messages (#2903) by @j-t-1 - Modify error messages in PdfWriter (#2902) by @j-t-1 [Full Changelog](5.0.1...5.1.0)

ssjkamei and others added 30 commits September 24, 2024 13:07

BUG: Missing spaces in extract_text() method (py-pdf#1328)

5400f5a

Revert "BUG: Missing spaces in extract_text() method (py-pdf#1328)"

aac0436

This reverts commit 5400f5a.

BUG: Missing spaces in extract_text() method (py-pdf#1328)

64b1c92

BUG: Missing spaces in extract_text() method (py-pdf#1328) add test

70e9b38

Revert "BUG: Missing spaces in extract_text() method (py-pdf#1328)"

65224e1

This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (py-pdf#1328) BUG: Missing spaces in extract_text() method (py-pdf#1328) add test

Merge branch 'main' of https://github.com/ssjkamei/pypdf

788d56d

BUG: Missing spaces in extract_text() method (py-pdf#1328) Convert fo…

f6dcb43

…nt size comparison to ratio

Correction to new file URL.

fd1c489

Co-authored-by: Stefan <[email protected]>

BUG: Missing spaces in extract_text() method (py-pdf#1328) calculatio…

2873b9e

…n efficiency

BUG: Missing spaces in extract_text() method (py-pdf#1328) Simplify t…

7597704

…he assertion process

Merge branch 'py-pdf:main' into main

4a2afe9

BUG: Issue in text extraction (spaces) (py-pdf#1153)

fb4de41

BUG: Issue in text extraction (spaces) (py-pdf#1153) add test

373eaec

style: Correcting code style issues

066f594

Text position return support

d406e23

Add code for CIDFont

d338e18

Added horizontal CIDFont calculation code

f7c4236

Style: Correcting code style issues

a32fbc9

Integrate font width calculation and space width calculation

a237f2d

Font width map and space width acquisition process separation

e159e4d

Revert to original adjustment space width

a19a8f4

Supports diagonal travel distance

6dbda50

Font size defaults to twice the space

34efe52

Get the default space width from the argument

52aa7ac

fix self-made bugs

7a028bb

Style: Correcting code style issues

f02fa23

Style: Correcting code style issues

980d831

fix self-made bugs

5e6a0dd

Style: Correcting code style issues

8078ac1

Compliant with PDF1.7 specifications

b842cee

ssjkamei and others added 15 commits October 1, 2024 22:35

Convert character map keys from int(ord) to str

b13b97f

Style: Correcting code style issues

ef73315

Update pypdf/_cmap.py

f884160

Co-authored-by: Stefan <[email protected]>

Update pypdf/_text_extraction/__init__.py

20a6883

Co-authored-by: Stefan <[email protected]>

Exception code omitted

e6132fa

Style: Correcting code style issues

9a82eb8

Style: Correcting code style issues

d4f1835

fix self-made bugs

96fcf7c

fix self-made bugs

780a632

Insufficient height consideration for front and rear fonts

ce11d0d

style: Correcting code style issues

03eb1cb

Merge branch 'py-pdf:main' into main

3463446

MAINT: Unnecessary character mapping process

661b8b5

Delete debugging code

65c368f

Style: Addition of type

c3cb26b

ssjkamei added 7 commits October 4, 2024 10:31

Style: Correcting code style issues

707adc1

Style: Correcting code style issues

85dbf2c

Style: Correcting code style issues

00009f2

Coverage check fix

f869427

Self-made bugs

7779678

Style: Correcting code style issues

bb09c13

Style: Made easy to understand.

b25b28f

stefan6419846 reviewed Oct 4, 2024

View reviewed changes

pypdf/_cmap.py Show resolved Hide resolved

stefan6419846 approved these changes Oct 4, 2024

View reviewed changes

stefan6419846 merged commit abb62ac into py-pdf:main Oct 4, 2024
17 checks passed

ssjkamei deleted the MAINT--No-unnecessary-character-mapping-process branch October 4, 2024 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Unnecessary character mapping process #2888

MAINT: Unnecessary character mapping process #2888

ssjkamei commented Oct 4, 2024 •

edited

Loading

codecov bot commented Oct 4, 2024 •

edited

Loading

ssjkamei commented Oct 4, 2024

stefan6419846 left a comment

stefan6419846 commented Oct 4, 2024

MAINT: Unnecessary character mapping process #2888

MAINT: Unnecessary character mapping process #2888

Conversation

ssjkamei commented Oct 4, 2024 • edited Loading

codecov bot commented Oct 4, 2024 • edited Loading

Codecov Report

ssjkamei commented Oct 4, 2024

stefan6419846 left a comment

Choose a reason for hiding this comment

stefan6419846 commented Oct 4, 2024

ssjkamei commented Oct 4, 2024 •

edited

Loading

codecov bot commented Oct 4, 2024 •

edited

Loading