Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docx reader bug: 1 row table might be parsed as 1 header row with empty body #3285

Closed
ickc opened this issue Dec 7, 2016 · 10 comments
Closed

Comments

@ickc
Copy link
Contributor

ickc commented Dec 7, 2016

It started in pandoc-discuss but I finally got some solid MWE on this:

With 2-table-1.docx,

pandoc -t native 2-table-1.docx
[Table [] [AlignDefault,AlignDefault] [0.0,0.0]
 [[OrderedList (1,Decimal,OneParen)
   [[Para [Str "Some",Space,Str "other",Space,Str "thing"]]
   ,[Para [Str "Something"]]]]
 ,[Plain [Strong [Emph [Str "testing"]]]]]
 []]

Then from this native, write to markdown and read back to native,

pandoc -t native 2-table-1.docx | pandoc -f native -t markdown | pandoc -f markdown -t native
[Table [] [AlignLeft] [0.5277777777777778]
 [[]]
 [[[Plain [Str "1)",Space,Str "Some",Space,Str "other",Space,Str "thing",Space,Strong [Emph [Str "testing"]]]]]
 ,[[Plain [Str "2)",Space,Str "Something"]]]]
,HorizontalRule]

I believe the problem is caused by the last row []] in the 1st native. The pandoc docx reader somehow treated the 1 row table as a header row with empty body.

By the way, as a general rule, is it safe to assert that the align-list, width-list, header-list, and each of the row-list are all having the same length?

@jgm
Copy link
Owner

jgm commented Dec 7, 2016

Perhaps @jkr can tell us more about the heuristics the docx reader uses to determine what is a table header.

By the way, as a general rule, is it safe to assert that the align-list, width-list, header-list, and each of the row-list are all having the same length?

Yes. If we used GADTs I suppose we could enforce this on the type level.

But probably the table AST will have to change anyway to make room for colspans and such.

@jkr
Copy link
Collaborator

jkr commented Dec 8, 2016

Table headers are determined by whether or not the "w:firstRow" attr in "w:tblLook" is set to true. It can also be set by bitmasks, because why not. See here:

http://stackoverflow.com/questions/23134215/ooxml-how-is-a-table-header-heading-encoded

@jkr
Copy link
Collaborator

jkr commented Dec 8, 2016

In the case of the submitted docx file, w:firstRow is true, even though there's only one row. Fix seems like a one-liner: special-case the one-row table when we're figuring out the header in Docx.hs.

Only argument against this would be that a header-only table is legal if that's what you really want. But I think it's worth allowing for the fact that docx forces you into doing things that no one would really want. So I'll put in the fix unless you have any objections.

@jgm
Copy link
Owner

jgm commented Dec 8, 2016 via email

@jkr jkr closed this as completed in 8ced8cb Dec 8, 2016
@ickc
Copy link
Contributor Author

ickc commented Dec 13, 2016

With commit 8ced8cb in pandoc 1.19.1, the output is now

[Table [] [AlignDefault,AlignDefault] [0.0,0.0]
 []
 [[[OrderedList (1,Decimal,OneParen)
    [[Para [Str "Some",Space,Str "other",Space,Str "thing"]]
    ,[Para [Str "Something"]]]]
  ,[Plain [Strong [Emph [Str "testing"]]]]]]]

1 minor problem I have is the header now is an empty list [], not a list of n empty list [[],[]]. Normally, other readers like the markdown reader would take the later approach.

This behavior is not cometic only but may causes problem when using filters, e.g. in panflute.

@sergiocorreia
Copy link

We could introduce flexibility and accept empty lists in headers, alignments and widths.

The cost would be a larger potential for unforeseen bugs down the road. Let me know if anything changes on the pandoc side; if empty lists stay, then I'll just push and update that accepts it as a valid input

@jgm
Copy link
Owner

jgm commented Dec 13, 2016 via email

@ickc
Copy link
Contributor Author

ickc commented Dec 13, 2016

By the way, I notice the widths output by the docx reader is always a list of 0.0s. I didn't test it thoroughly, but I worry that it could break something, e.g. I found that the LaTeX writer would not use minipage if the widths are 0.0s.

@jkr
Copy link
Collaborator

jkr commented Dec 13, 2016

Just pushed a fix for the header row issue. The cell widths have actually never been implemented -- I'll take a look at that later today.

@ickc
Copy link
Contributor Author

ickc commented Dec 14, 2016

@jkr, would the docx reader emits table cells containing block elements? If not, the width might not be critical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants