Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semanticate: Work on the XML elements tree #753

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gvtulder
Copy link
Contributor

Rough example (see #752) of how the semanticate rules could be applied to the content only by walking recursively through the XML tree.
Not sure if this is a good idea, so this is a fairly quick proof-of-concept.

The tricky bit is that the regular expressions might introduce new XML elements. In this version, this works by applying the regex to text string (which doesn't contain any tags), parsing the regex output to create any new XML elements, and finally inserting those in the document at the right place.

@acabal
Copy link
Member

acabal commented Sep 23, 2024

I understand the reasoning for this PR, but I think regex is acceptable for this use case 95% of the time. It's much simpler to understand and - importantly - maintain, and with full XML parsing there will likely be complex edge cases that we will have to custom code around. Also any errors introduced by a regex are easily corrected using git diff - it never results in big catastrophes.

If you want to play around with it further go ahead, but I think you'll find that this will become very unwieldy very fast. If you're able to put together a complete solution that is 100% correct vs. our current implementation's 95% correct, and it doesn't look like a maintenance hellscape, then we can consider it.

@alerque
Copy link

alerque commented Sep 23, 2024

Sure 95% of the time a regex will get you there, but your existing regexes are needing to be stuffed with lookaheads and other features to compensate for the lack of context. Flatly XML cannot be parsed with regexes. On the other hand walking XML trees using something built for the job is pretty straight forward and should be much easier to maintain and contribute to in the long run.

@acabal
Copy link
Member

acabal commented Sep 23, 2024

I understand. We do not need lookaheads in the general case (or, strictly speaking, in this case). That is because the occasions in which the regexes without lookaheads are wrong, given our narrow use cases, are rare, and undoing the mistake one of these regex makes is simple - an error is never catastrophic. Remember that these tools are not general purpose XML processors, they are for the very narrow use case of ebooks in the English language that are made with our style guide.

On the other hand, parsing an entire syntax tree and dealing with the hundreds of edge cases of the various English language constructs we must examine is extremely complex, error prone, and most importantly, difficult to maintain. It will also likely be quite slow because the tree must be walked and updated for almost every check requiring even the most basic of logic.

I understand the desire for intellectual purity and formal correctness, but using regex is easier to develop and maintain as a practical matter. Of course I'm happy to be proved wrong!

@gvtulder
Copy link
Contributor Author

I agree with the point that regexes are simpler, and I don't know where the sweet spot lies. However, a more XML-based approach could still use a lot of regexes, the main difference is choosing where they are applied.

In my current implementation, the regex rules are very similar to what they are in the current implementation. I copy-pasted them from the old function and made only some minor changes:

tools/se/formatting.py

Lines 176 to 180 in b4e9357

sub(r"(?<!(?:\.|\B))(P\.(?:P\.)?S\.(?:S\.)?\B)", r"""<abbr epub:type="z3998:initialism">\1</abbr>""")
sub(r"(?<!(?:\.|\B))inst\.", r"""<abbr xml:lang="la">inst.</abbr>""") # `inst.` is short for `instante mense` but it is not italicized
sub(r"(?<!(?:\.|\B))([Ii])\.e\.", r"""<abbr epub:type="z3998:initialism">\1.e.</abbr>""")
sub(r"(?<!(?:\.|\B))([Ee])\.g\.", r"""<abbr epub:type="z3998:initialism">\1.g.</abbr>""")
sub(r"(?<!(?:\.|\B))\bN\.?B\.\B", r"""<abbr epub:type="z3998:initialism">N.B.</abbr>""")

So the representation of the rules doesn't have to change much. The difference is that these regexes are no longer applied to the entire raw XML document at once, but only to one bit of text at a time. This avoids changes to attribute values and makes it easier to say things like: "do not apply these rules inside <abbr> tags".

One thing that does become a bit more difficult is when the regexes look at XML tags. For example, when deciding to add class="eoc" to <abbr> tags at the end of a paragraph <p>: (original regexes in the comments)

tools/se/formatting.py

Lines 228 to 239 in b4e9357

# sub(r"<abbr>etc\.</abbr>([”’]?(?:</p>|\s+[“‘]?[\p{Uppercase_Letter}]))", r"""<abbr class="eoc">etc.</abbr>\1""")
# sub(r"""<abbr( epub:type="[^"]+")?>([^<]+\.)</abbr>([”’]?</p>)""", r"""<abbr class="eoc"\1>\2</abbr>\3""")
if el.text == "etc." and not el.children:
if el.tail and regex.match(r"[”’]?\s+[“‘]?[\p{Uppercase_Letter}]", el.tail):
eoc = True
if el.text.endswith(".") and not el.children:
if el.parent.tag == "p" and el.next is None and (not el.tail or el.tail in "”’"):
eoc = True
if eoc and not "eoc" in (el.get_attr("class") or ""):
el.add_attr_value("class", "eoc")

Or a simpler example, checking we're inside a <p> and using the $ marker in the regex:

tools/se/formatting.py

Lines 192 to 194 in b4e9357

# keep a period after TV that terminates a clause
if el.tag == "p":
sub(r"(?<!(?:\.|\B))T\.?V\.([”’]?)$", r"""<abbr epub:type="z3998:initialism">TV</abbr>.\1""")

Secondly, it also becomes more difficult to apply regexes that match text on both sides of a tag, but there currently aren't any rules that need this.

I'm not convinced this is necessarily better, but it's good to know that it's not a binary choice between regex and syntax tree parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants