Skip to content
jlward edited this page Apr 24, 2013 · 1 revision
  • Tags that have ever been styled bold/italics/underlined will always have the b/i/u tags. You need to check to see if the val is false or not.
  • hyperlinks and ins/del tags are all basically the same. (Except for the href). The tricky thing is that the links/ins/del tags have their own runs of texts, make sure to get them all when constructing the text (not sure if this is a bug or not in pydocx)
  • There are two types of images: drawings and picts. After some pre-processing you can treat them the same.
  • Images dimensions are measured in EMUS. There are 9525 EMUS per pixel
  • Font sizes are hard
  • rowspans are not the same in OOXML as they are in html. If you have in html, that equates to
  • If two lists items have the same numId, they are part of the same list (a curse and a blessing both)
  • headers (h1, h2, etcs) look a lot like lists. You need to check the style and see if it is considered a header (which is case-insensitive)
  • Images stored on the document then resized are not resized in the ZIP file.
  • Anchor hrefs and image sources can be found in ‘word/_rels/document.xml.rels’ And they are based on an Id mapped to a Target
Clone this wiki locally