-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crazy ideas: table structure #86
Comments
I'm not very familiar with the use of multi-body tables, but the examples I've seen do normally have constant-width row heads. So Having the span types be a wrapper around At the moment all the writers assume that they may encounter any possible The alternative is to give the Other thoughts:
|
On that topic, we could get rid of the double-normalization that currently happens by simply removing that behaviour from the |
From a filter perspective, I like a lot the power of the new tables but also dread a bit the complexity for those implementing the filters. For instance, it feels a bit inconsistent that, within a TableBody, you use the RowHeadColumns integers to indicate headers on the left but have to create a separate list of rows to indicate headers at the top. When walking over a tree, the structure of a table is also quite different from those of other elements. With most elements, you just traverse the children nodes directly, and each element only has one type of children. However, with tables, you need to traverse the main tree (with multiple TableBody elements) and also need to traverse the head and the foot of the table. Maybe it would be simpler to say that a Table can contain three types of elements, TableHead, TableBody, and TableFoot, with the constraint that TableHead must be the first element and TableFoot must be the last element? With these two changes (adding a HeadRows to TableBody, and moving TableHead TableBody and TableFoot to the same list), a Table would be much easier to iterate, as every table would just have three types of children, which in turn will have lists of Rows, which in turn just have Cells. (I'm not necessarily saying we should go that way, just thinking aloud in line with the title of this issue :) ) |
I'll think out loud for a moment, with the caveat that I don't know much about the characteristics in Haskell programming and am thinking more in the perspective of pandoc filter writer, basically mainly from my pet project pantable written in Python. First, the new table AST is a great thing to happen. It is long overdue. But at the same time it is a very disruptive change. I think any filter involved in manipulating pandoc tables are broken by this. In the case of pantable, I took advantage of this change to rewrite it completely as its new features really worth the efforts. But what I'm saying is may be people writing filter for pandoc table are still playing catch-up to this new AST, a yet another unnecessary change may be unwelcomed. Eg the proposed RowHeadColumns would make it becoming more restrictive. Not that the restriction is that restricting in practice, but it is still a regression. (But I understand the point about head and foot.) About the grid data structure, I think the current AST structure that's more HTML like is fine. I think having a grid structure as an alternative would be helpful. But I think replacing it by that would be an unnecessary change. In fact, in the work in progress pantable, the internal structure is exactly a grid and it basically provide a way you can jump losslessly between it (grid) and panflute AST (basically pandoc like AST). So what I'm saying is probably you should have both, and lazily convert between both on demand within Haskell. (This is speaking from some OOP thinking if that makes sense... Eg if the reader already lay things on a grid, then it has a grid representation already, if the writer requires a grid, then no conversion, else convert and perhaps normalized once here...) For JSON filter, may be it is good to have one single representation only for consistency, and let the filter framework do our own grid and normalization (which is what I'm doing in pantable). Or may be you can allow both and let the filter user choose (Eg an arg from pandoc to control this.) |
Just catching up on this thread. I suppose that since we're already in a period of churn with the table types, further changes might be less disruptive now than if we waited til everything worked again. Especially if the changes made it easier to process tables. I'm curious what the grid representation of tables would look like in a Haskell type? The suggestion to treat row headers like column headers (using an Int) seems a great one. @despresc - are there reasons not to do it this way? |
Do you mean indicating the number of rows in the head of a |
I'm not sure how much easier it will make things. We should think about that before making a change such as this (which would require a lot of changes throughout the code base). Maybe @sergiocorreia could elaborate on that. But it's hard to see a good rationale for using two lists for row headers but using an Int for column headers. |
Yes! Some arguments in favor are:
This then reduces the mental overhead of manipulating tables. For instance, converting a
On the other hand, this might make it a bit harder to create filters that apply to table body heads. Right now, in panflute this would involve: def action(e, doc):
if isinstance(e, TableRow) and isinstance(e.parent, TableBody) and e.location=='head':
pass # do something With the change, the code would be: def action(e, doc):
if isinstance(e, TableRow) and isinstance(e.parent, TableBody) and (e.index < e.parent.num_head_rows):
pass # do something |
Just trying to talk about the few things that is difficult to handle in the current AST, with the mindset of implementing a one-one correspondence of the pandoc table AST to a grid structure in pantable. (Below I used a terminology to distinguishing things and it would be great if there's an official terminology associated with them. Whatever is in double-quote is my made-up terminology. ICA is (Id, class, attr). I called the current AST "tree-like representation" and the proposed one "grid-like representation".)
From my experience of implementing a grid representation of the pandoc table AST in Python, while the above things are annoying, it becomes easy once the grid structure is set in-place. I hold the view that there's no best representation here. (Especially with span, because without span and tree representation is the grid representation.) The default tree representation in pandoc is more suitable if the input/output formats are also tree like, such as HTML and LaTeX. (My guess is that the HTML and LaTeX writers first supporting table span is not a coincidence, but because it is a more natural translation.) That's why I recommend you have 2 representations, a grid one and a tree one. For some intput/output, if the grid one is easier to work with, put it in a grid representation. The grid representation and the tree representation should be one-one correspondance so there's no problem in translating back and forth. And for performance, I think it is debatable if a grid-only approach would be faster. I would expect for some input-output pairs such as the HTML the tree representation is more natural, and hence fastest to work with them alone. Also I am not sure translating the document into a grid structure is (significantly) faster than put it in a tree than a grid. About RowHeadColumns:
|
Just made aware of this discussion by an allusion in pandoc-discuss... Don't know Haskell, but here's a few comments... and other crazy ideas! sergiocorreia:
Been working/playing on and off on a Lua port of docutils' grid_table parser in the past few weeks for a Lua filter, I can echo this impression. In fact, I'd go a bit further and say they could be viewed of the same TableBody kind, which can have a (sub)head and body. The TableHead would be special in being the first with a (sub)head but no body (and no rhc); the TableFoot could be special in being the last, with a (sub)head with a single empty row (not nil), and a body (and no rhc). [In other words (trying to be clearer), if a Table has no TableHead nor TableFoot, you can put enough information in the first TableBody and last TableBody so that a writer -- or intermediate filter -- could know they can be used as head and foot.] Of course it's a bit of a hack to think this way, and perhaps there's a need for a more definite identity (and restrictions) for head and foot, but it was easier for intermediate solutions before translating to pandoc internal structure. tarleb:
Since all cells now have a ICA, wouldn't it be sufficient that pandoc write the "coordinates" in cells' attributes in the current scheme? |
I think we actually should discuss this on pandoc-discuss first, especially change like this is not about adding a feature, but how best to represent table. Signal-to-noise might be lower in pandoc-discuss, but since this change is going to affect a lot of people with no obvious gain it might be better discussed there first. At the very least we can have a community-wise consensus on the last change that is needed regarding the table.
In pantable, by putting the cells on a grid, I don't distinguish the head, "head of body", "body of body", foot in the table itself. I keep a list of int like this:
Going back and forth between table and cells will be difficult here. To find a cell you need to walk the whole table in this case. e.g. normalizing tables probably need to do this. So a grid representation is still advantageous in some operations. One of the premise in this issue is that "Going from a grid structure to a list of rows and cells seems much simpler than the other way around." But I think you probably just pushed "going from 'tree-representation' to 'grid-representation'" earlier in the stage of reader. In cases that the reading format is also "tree-like", such as HTML/LaTeX, may be there's no gain? |
ook, was just surprised about the news. I'll keep the 'noise' to a minimum 😉.
True, was not thinking about going "back and forth", only one way. |
I don't mean that, I mean the topic of this issue might better be discussed in pandoc-discuss first, I don't mean your comment in particular. I meant if we discussed there the signal to noise might be lower but it probably can get more attention from people that would be affected by this kind of change most. It's a bit unusual, but AST change as big as this is unusual. And if we have to change it a 2nd time then we better discuss it more and nailed it. |
This is roughly what I imagine: -- | A grid cell contains either a real table cell, or is the
-- continuation of a column or row-spanning cell. In the latter case,
-- the index of the continued cell is provided.
data GridCell
= ContentCell Attr Alignment RowSpan ColSpan [Block]
| ContinuationCell RowIndex ColIndex
newtype RowIndex = RowIndex Int deriving (Eq, Ix, Ord)
newtype ColIndex = ColIndex Int deriving (Eq, Ix, Ord)
-- | Cells are placed on a grid. Row attributes are stored in a separate array.
data CellGrid = CellGrid
{ cellArray :: Array (RowIndex,ColIndex) GridCell
, rowAttrs :: Array RowIndex Attr
}
data GridTable = GridTable
{ tableAttr :: Attr
, tableCaption :: Caption
, tableColSpecs :: Array ColIndex ColSpec
, tableHead :: CellGrid
, tableBodies :: [CellGrid]
, tableFoot :: CellGrid
, tableRowHeads :: Int
} Or the table could be one large grid, and the different parts would be given as row index ranges: data GridTable = GridTable
{ tableAttr :: Attr
, tableCaption :: Caption
, tableColSpecs :: Array ColIndex ColSpec
, tableGrid :: Array (RowIndex, ColIndex) GridCell
, tableHead :: RowRange
, tableBodies :: [(RowRange, RowRange)] -- (body header, real body)
, tableFoot :: RowRange
, tableRowHeads :: ColRange
} |
FWIW, the grid representation would be really useful to have for docx tables, where rowspans are handled by adding "merge cells" in subsequent rows. |
Can you remind me: is the idea to change the main table type to a grid representation, or to add a GridTable type and conversion functions from/to the Table we have now? (In the latter case, these wouldn't necessarily need to go in pandoc-types; they could go in pandoc itself.) |
The original idea was to update the main table type, but it's probably too late for that now. Maybe it's just my way of thinking, but I realized that I find grids the most natural way to think about tables. I may try to write something akin to a GridTable for the docx writer, or maybe modify the AnnotatedTable, and we can see where that leads us. |
I wouldn't rule out changing the table type again, if the grid model is much easier to deal with. But that's going to be a lot of work. Maybe a good first step would be to implement the type in an auxiliary model in pandoc, together with conversion functions. |
Grid model has its own problems but I agree it seems easier to work with. I agree with @jgm in the direction of
|
I have got to admit that I lost a good bit of enthusiasm for grids after using them for the docx writer. I still think they are the right data structure for that specific case, but I forgot just how wrong it feels to use unsafe operations like array indexing in Haskell. We'd have to provide a number of good helper functions to safely use arrays throughout the code base. I currently tend towards using grids as intermediate structures only. |
We should completely avoid unsafe partial functions like |
Smaller point that came to mind while I was implementing column width reading in the docx reader: would it be better to have the column widths as |
I wish I'd used Rational rather than Double from the beginning. That is indeed a better choice. But maybe not worth fixing now, until the next major revision of pandoc-types. |
I agree rational is a good choice. But what would that becomes in JSON AST? A float or specially formatted string? [pantable (pandoc table ast in Python) allows rational ratio optionally. Also the auto width calculated from say a markdown table is already rational.] |
I suppose the JSON representation could just be a string, yes. That might make it a bit more difficult for consumers of the AST other than pandoc. |
Here are a few thoughts I had while implementing table features in pandoc readers and writers. None of the suggestions have been thought through, so take the below with a grain of salt.
Promote RowHeadColumns from TableBody to the full table. The number of row heads will typically be constant for all table bodies; it is also relevant for, and should also apply to, the table head and foot.
Use grid-based data structure. Going from a grid structure to a list of rows and cells seems much simpler than the other way around. Most writers need to compute the table grid, so it might as well be the main structure. This could provide additional type-level guarantees and make it easier to access cells column-wise, e.g. when checking for the most frequent cell alignment in a column.
A possible data structure would be
Array
from packagearray
. It is already a transitive dependency of pandoc-types (through deepseq) and described in the Haskell 2010 report.HTML limits rowspan to a max value of 65534 (= 2¹⁶ - 2), and it would be reasonable to adopt this limit. RowSpan could be then be a newtype wrapper for
Word16
.The
colspan
attribute is limited to a max value of 1000 in HTML. Like for rowspan, this seems like a reasonable limit. ColSpan could also wrapWord16
.Cc: @despresc
The text was updated successfully, but these errors were encountered: