-
Notifications
You must be signed in to change notification settings - Fork 39
37 CFR 1 § 1.445 - 2012 annual edition introduces unusual tables #368
Comments
@gregoryfoster noted that this gets even more hairy in the 2013 version: https://www.gpo.gov/fdsys/pkg/CFR-2013-title37-vol1/xml/CFR-2013-title37-vol1-part1.xml#seqnum1.445 I think the ideal outcome in the 2013 version would be depths like:
That's possible to implement now by modifying the XML, but it'll be a slog. One potential alternative:
we could
The output would be a non-indented section. The major downside is that having a non-indented section will likely confuse diffs and will certainly cause issues when applying changes (final rules). Flattening the paragraphs could bet a user over a hump, but may leave them more confused when things don't work correctly later. |
Thanks for documenting this, @cmc333333. Technically speaking, would you say the XML is accurate but the placement of markers within <P>(1) A transmittal fee (see 35 U.S.C. 361(d) and PCT Rule 14) consisting of:</P>
<GPOTABLE CDEF="s30,8" COLS="2" OPTS="L0,tp0,p1,8/9,g1,t1">
<ROW>
<ENT I="01">(i) A basic portion</ENT>
<ENT>$240.00</ENT>
</ROW>
</GPOTABLE> The 2013 annual edition expands on this precedent of tables containing markers (but not on every row!). Assuming that we need to modify the |
I'd argue that the XML isn't accurate -- in this case, I think it should be: <P>(1) A transmittal fee (see 35 U.S.C. 361(d) and PCT Rule 14) consisting of:</P>
<P>(i)</P>
<GPOTABLE CDEF="s30,8" COLS="2" OPTS="L0,tp0,p1,8/9,g1,t1">
<ROW>
<ENT I="01">A basic portion</ENT>
<ENT>$240.00</ENT>
</ROW>
</GPOTABLE> I see a few options for how to proceed:
So, 1) will work, but is pretty ham fisted. 2) may work, but would require some very careful consideration. 3) may work, but seems like the wrong direction to me. |
I'd be up for modifying the source XML, but further research makes me think that may not be the right path. There are earlier precedents in 37 CFR 1 of usage of I'm guessing that no earlier tables caused an issue for the parser because their rows contain regulation markers at the same depth or deeper. These nodes would be flagged as I'm tentatively interpreting this to mean that other examples of tables containing regulation markers are not being captured by the parser, but instead interpreted as flattened Is there another path? One which accepts that |
You can think of the nodes as rendering like a bunch of
rendered as: <ol>
<li><p>(a) ....</p><ol>
<li><p>(1) ....</p><ol>
<li><p>(i) ....</p><ol>
<li><table /></li>
</ol></li>
<li><p>(ii) ....</p></li>
</ol></li>
</ol></li>
</ol></li> I think the first step is to think about what the ideal markup would be and try to work back to what that'd require in terms of parsing. In this scenario, I think the ideal markup looks like the above, which leads me to support modifying the XML to match. Here's some more thoughts off the top of my head: Option 4: Hypothetically, we could allow some special logic around tables so that nodes within a table rendered as rows (or something), though that'd be a pretty heavy rework. Consider how the current XML would need to be parsed:
to become <ol>
<li><p>(a) ....</p><ol>
<li><p>(1) ....</p><ol>
<li><table>
<tr><td>(i) ... </td><td>...</td></tr>
<tr><td> ... </td><td>...</td></tr>
</table></li>
<li><p>(ii) ....</p></li>
</ol></li>
</ol></li>
</ol></li> I'm not sure how we'd handle additional paragraph depths -- we'd want Option 5: When we go to derive the depths, we could do some sort of deep inspection of the contents of the table and virtually "expand" the table to include all of the markers it contains.
This would get over the depth-derivation hump, but wouldn't be a complete solution. Consider what happens when we want to reference 1-445-a-1-i (e.g. in a citation): that node doesn't exist in the tree. We could add the same "virtual" searching logic, but we'll quickly be re-implementing node trees in this "virtual" space, which seems to defeat the point. Option 6: We could try to split the single table into multiple nodes. This is very similar to the preprocessing logic proposed in option 2 and would carry the same risks -- the difference is where the logic is placed (is it a preprocessing step, or is it just part of how |
Thanks for engaging on this one, @cmc333333. After some thought this week, I think we've surfaced two separate issues:
I decided to try fixing the second issue to see how many instances we'd encounter. I fixed the first identified anti-pattern (2012 annual edition 37 CFR 1 § 1.445) by locally overriding the file with the ideal XML you outlined in an earlier comment on this issue. The parser then identified a similar issue in the 2013 annual edition 37 CFR 1 § 1.17 (see marker I suggest we keep this issue open, but change its focus to the big issue of handling |
I've opened dialogue with the USPTO's Office of Patent Legal Administration to get the errors we've been surfacing in the GPO's XML versions of 37 CFR 1 fixed. I've been directed to the USPTO's online version of their regulations, the Manual of Patent Examining Procedure (MPEP), Appendix R, which appears to be the primary source for the agency (only current revision in HTML and PDF). It appears that some errors like the current issue are cropping up during a transformation of the MPEP source documents into the XML format deployed by GPO. If you look at 37 CFR 1 § 1.445 in the MPEP, you'll see the tables appear offset and do not include the regulation markers. Here's the GPO's current version (you'll have to find/scroll as anchors are useless in the current version - unless you know how to URL encode a thin space character?). Does 18F have any insight into the transformation of source documents between the USPTO and GPO? |
Hey @gregoryfoster, I'd doubt that MPEP is the source for the GPO XML. From what we've seen, agencies send over Word docs to indicate amendments to regulations -- they don't send over whole regulations unless the whole part is being replaced. It's possible that the transform runs the other way (where the MPEP is downstream from the GPO) and then modified (by hand) further. This is the route CFPB's taken -- the "original" content from the GPO has been transformed (with a mix of automation and manual work) into a new document, which is then maintained separately. Of course, having this separate document means that their regulations may not always match the GPO (which carries more legal weight, if I understand correctly). |
Dev environment: current Master [ 0c650cd ] + PR #367 [ a389c4b ].
Running:
eregs --debug annual_editions 37 1
results in an error when processing § 1.445 of the 2012 annual edition:https://www.gpo.gov/fdsys/pkg/CFR-2012-title37-vol1/xml/CFR-2012-title37-vol1-part1.xml#seqnum1.445
a.1.i is parsed as:
and the section formerly titled a.1.iii is parsed as:
In both cases, the table formatting appears to be confusing some aspect of the parser, resulting in labeling both sections
MARKERLESS
. The absence of the a.1.i marker raises an error atregparser/tree/depth/heuristics.py:47
.The text was updated successfully, but these errors were encountered: