-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utilize consistent internal nomenclature for outlines/bookmarks #1098
Comments
I agree that we should use consistent names / avoid synonyms. I am also still confused about the outlines/bookmarks topic. If they really are the same I would also be in favor of dropping "outlines" and going with "bookmarks". |
I am a bit confused by your code. >>> from PyPDF2 import PdfReader
>>> reader = PdfReader("overlay.pdf")
>>> reader.bookmarks
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'PdfReader' object has no attribute 'bookmarks'
>>> reader.outlines
[] |
Here is a list what we might touch in this context: PdfReader:
PdfWriter:
There is some overlap in |
Does anybody know how outlines/bookmarks and named destinations relate? |
Reduce code duplication See #1098
Reduce code duplication See #1098
Sorry for the confusion, the code block was to serve as an example of the proposed behavior. I edited the heading above to try and reflect that.
An important reason to keep the term "outline" within the code base is that this is this is within the lexicon of the PDF specification. See PDF 1.7 specifications section 12.3.3 (page 367. index 375). For example, in the If we had to choose a single term, I'd argue that it must be "outline" due to the document specification. Anecdotally, I think most people refer to them as "Bookmarks" but, as noted in the above blog post, there are issues with that. Either way, the code base really needs some sort of consistency regarding the name, which is main problem I think should be addressed. Perhaps my initial proposal isn't very Pythonic (There should be one-- and preferably only one --obvious way to do it). Rather, the best solution would probably be to deprecate all terms using the word
A destination defines a view within the document and is composed of the page, location on the page, and magnification. Destinations can be associated with outlines, annotations, or actions. Named destinations provide a convenient means of referring to a specific destination. They name/destination key/value dictionary is stored in the document catalog (see PDF spec 1.7 section 7.7.2, page 72/index 80 for a visual; see section 7.7.4, page 80, index 88 for more discussion) and allows one to simply reference the name when wanting to reuse the destination. An outline object can contain the destination information under the key
Alternatively, an outline object may not contain a Both the I hope that provides some clarification. |
Thank you! That helped a lot
I was thinking about this for the
The users of PyPDF2 are mostly not people who know the PDF standard at all. They very likely don't care what is done in the PDF standard. There are also libraries / business applications build on top of PyPDF2. We need to communicate changes clearly and give a good heads-up before they start to be breaking. But if we think we can improve the situation for our users, we can use whatever name we want. |
However, I do value the opinion of contributors as my view might not represent the community well. Especially @pubpub-zz @MasterOdin , what do you think about renaming |
Oh, I just read https://pspdfkit.com/blog/2019/understanding-pdf-outline/ .... so there is actually a difference between outlines and bookmarks? |
I think that they're the same thing, it's just that Adobe calls them bookmarks, but the PDF spec calls them outlines, and it's just kind of on the vendor to use whatever terminology they want. I agree with @mtd91429 that I think that internally, use the term "outline" as that's what's in the PDF spec. I like having one way of doing things, and so would vote to expose only outline named functions, with documenting that outlines and bookmarks are synonymous in the function doc. For reference, pikepdf uses outline: https://pikepdf.readthedocs.io/en/latest/topics/outlines.html |
outlines called bookmarks in Adobe as Odin is saying, but there is also the named_destination that are considered as bookmarks. |
Reduce code duplication See py-pdf#1098
I think the only additional component would be the As far as nomenclature goes, I propose the following terminology for clarity. An "outline" is a collection of "outline items". There is only one outline in the document and it is composed of multiple outline items. For example:
In this, "Chapter 1" is an I recognize that the plurality issue adds another layer of complexity to this rectification. I think that a single letter (s) is more confusing as the distinguisher between the singular and plural forms that the above proposal. I propose the following nomenclature changes (I currently don't have an opinion on public/private designation): For PdfReader
For PdfWriter
For PdfMerger
For generic.py:
|
Overall, I like your changes. Especially the clear nomenclature. In addition, I like to have symmetry between PdfReader and PdfWriter. I would like to be able to do something like this:
However, the semantics of this are not completely clear. For "add", as a user, I would expect an "append" behavior. So if I do this twice, I have double the outline items. In contrast, we might want to allow |
@mtd91429 What do you think about making |
some time ago, I worked on different improvements on PyPDF4. I've introduced at that time some change on bookmark, aspecially a new parameter to specify where to add the outline. You may be interested to have a look at https://github.com/pubpub-zz/PyPPDF4.sav |
Yes, this becomes a much more difficult problem to consider. I'm currently not sure what the best way forward is. Because, as the two of you point out, the challenge is to specify where to add the outline item(s) onto a tree. I'll look at your work @pubpub-zz, how the writer object works in more detail and lay out some proposals.
I can understand why it may make sense to have that as a private method, but I'm not sure. Privatizing a once public method should not be done lightly for its unintended consequences on the end users' experience. |
An update on functions (largely for my own purposes): the function I think the |
I created PR attempting to address this issue (#1156). In there, I did not change public/private designation for methods as I felt that should be a separate commit + PR for each method (rather than bury it within this large change). |
This PR makes sure PyPDF2 uses a consistent nomenclature for the outline: * **Outline**: A document has exactly one outline (also called "table of contents", in short toc). That outline might be empty. * **Outline Item**: An element within an outline. This is also called a "bookmark" by some PDF viewers. This means that some names will be deprecated to ensure consistency: ## PdfReader * `outlines` ➔ `outline` * `_build_outline()` ➔ `_build_outline_item()` ## PdfWriter * Keep `get_outline_root()` * `add_bookmark_dict()` ➔ `add_outline()` * `add_bookmark()` ➔ `add_outline_item()` ## PdfMerger * `find_bookmark()` ➔ `find_outline_item()` * `_write_bookmarks()` ➔ `_write_outline()` * `_write_bookmark_on_page()` ➔ `_write_outline_item_on_page()` * `_associate_bookmarks_to_pages()` ➔ `_associate_outline_items_to_pages()` * Keep `_trim_outline()` ## generic.py * `Bookmark` ➔ `OutlineItem` Closes #1048 Closes #1098
Explanation
The PDF Reference uses the term "Outline" but recognizes "Bookmarks" as a synonymous term. From PDF Reference version 1.6 page 554 (section 8.2.2):
There is inconsistency within PyPDF2 regarding the nomenclature. In the
PdfReader
object, these objects are referred to asoutlines
; however, within thePdfWriter
object, they are referred to as both bookmarks (add_bookmark
,add_bookmark_dict
,add_bookmark_destination
) and outlines (get_outline_root
).Most PDF reference software refers to these objects as "Bookmarks". For example, a screenshot from the Adobe Acrobat:
I propose that PyPDF2 be modified such that all user-facing aspects of the code have redundant and synonymous functions using both terms (outline and bookmark), but that all internal nomenclature adopts a single and consistent term which performs the manipulations. Specifically, I think internally it should be
outline
.Proposed Code Example
The text was updated successfully, but these errors were encountered: