-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLSX support with ExcelJS #248
Conversation
General code clean up
Should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review and suggested changes: #249
Here's a few more questions I'd have:
- develop ExcelWorkbook in README.md?
- discussion on https://observablehq.com/d/25d78559efffa7fb#comment-c830048f97493242
- can we have {text: xxx} and no hyperlink property?
- escaping html (in links)
- add a few large files to test?
- name: xlsx; ExcelWorkbook, or Spreadsheet?
- confusion on the rows’ numbers (they start at 1 in the traditional string specifier, at 0 in array specifier)
src/xlsx.js
Outdated
} | ||
if (headerRow) for (let c = c0; c <= c1; c++) name(c); | ||
|
||
const output = new Array(r1 - r0 + 1).fill({}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is filling the output with a shared empty object for all rows, whereas the rows with values are reassigned below to new objects. Do we want to use an empty object to represent rows without values, rather than undefined? If we do want to use an empty object, I think we’ll still want a distinct object for each row, rather than sharing the object across rows. That could be done by moving the output[r - r0] = {}
below before the continue
rather that using array.fill here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to use sparseness/undefined initially but Inputs.table
had trouble with it, throwing an error trying to get a field from it. re-using the same object was a pre-optimization from a memory usage standpoint. I also toyed with using a Symbol("empty")
to make it even more explicit.
I slightly want to auto-filter these rows out of the return value since I feel like in usage that would be one of the first things I'd always end up writing anyway in the notebook, but that seemed like possibly surprising behavior at the same time?
* document xlsx (minimalist, we'll work on the notebook first) * fix coverage reporter (avoids a crash on my computer; solution found at tapjs/tapjs#624) * unknown sheet name * simplify rows naming * NN is always called on string (cell specifier such as "AA99") * test name * more range specifiers
Prettier + use default/base tap reporter
Co-authored-by: Mike Bostock <[email protected]>
Co-authored-by: Mike Bostock <[email protected]>
To add an idea while talking to @mbostock: in the future we could offer another option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great. Scoped to just what's needed.
I particularly like the hidden column "#" (row reference) which helps to refine the range interactively.
One last question: I wonder if we should clean up the cells we use as headers? I've tested this on xlsx files found at random on the web, and for example in
https://www.un.org/securitycouncil/sites/www.un.org.securitycouncil/files/cross-cutting_poc.xlsx one header is .........Formal agenda item
(with many spaces, represented here with dots) and another one also has a linefeed:
PP/OP/
PRST
it's usable (just a bit awkward), but if we want to do this clean-up, better earlier than later.
Yeah I think we could trim them? Unsure about weird spacing inside them though? Or awkward punctuation too... |
my preference would be to replace all \n \r by spaces, and trim
|
What about combine multiple whitespace characters inside the string into a single space? |
Does d3-dsv do any type of trimming? |
no… ok, then 8-) |
yeah ok, I think the default response is to encourage people to fix sources upstream and clean up those things in the xlsx files, then everyone benefits. I noticed a few values which could use trimming too and was tempted to do it in stdlib, so it's a slippery slope. |
Can add later, but an idea re: the "#" field: I've been tempted to add a |
Why do you want to remove line breaks? Personally I've used them in headers quite a few times to keep long headers more readable/organized. (Sorry if this was already discussed.) Edit: oops, look like this idea was already dropped in a later reply. |
Alternative to #215
Handles dates correctly, using UTC. Based largely on @Fil's previous exploration.
The main goal of this PR is to give people a way to extract data out of xlsx files quickly, efficiently, and correctly. To do that, we assume people should be able to visually recognize and find the data they want to extract, leveraging any previous familiarity with the xlsx file. We don't want to spend much effort on preserving styling or presentation (widths, fonts, value formatting, merged cells, frozen panes) or features used during building a spreadsheet like formula definitions (only results are extracted). At the same time, we might decide some presentation features are worth preserving if we think it would help people trust and recognize the extracted contents of the spreadsheet (number formatting I'd guess could fall into this).
Also, we should make it easy to maintain extracting new or updated data from an updated or mutable file, which implies the importance of the unbounded range feature ("12:" to mean extract starting at row 12 to the end).
The extracted data should work well with the rest of the downstream toolchains in Observable. So, plain JavaScript values (
NaN
) over descriptive objects ({ error: "#DIV/0" }
).The
Workbook
API hopefully will be reusable to represent a Google Sheet just as well in the future.The assumed, frequent workflow of this API is that a user would pass all of the data first to
Inputs.table
or similar for exploratory recognition, then filter it down using therange
andheaders
options described below.Workbook.sheet(name, { range, headers }): Record<string, any>[]
Returns an array of objects representing the contents of cells in a specific sheet of the workbook. An example return value may look something like:
Empty cells are skipped: objects will not include fields or values for them. Empty rows are not skipped assuming they'll aid in data recognition. Values are coerced to their JavaScript types: numbers, strings, Date objects. Formula results are included, but formula definitions ignored. Row numbers from the source sheet are included to assist with range specification and recognition.
name: string | number is a sheet name to get data for. If it's a string, it must match a name in
Workbook.sheetNames
. You can pass a zero-indexed number to get the corresponding sheet in order ofWorkbook.sheetNames
. For example,sheet(0)
is the first sheet.range: string specifies a single rectangular range of cells to extract from the sheet as an Excel-based representation of a range, e.g.
"B4:L123"
to mean from cell B4 in the top left to L123 in the bottom right, inclusive. By default if no range is specified, the entire sheet is extracted.Similar to Excel, the row or column part of the start or end may be omitted to mean the entire row or column, e.g.
"4:123"
to mean rows 4 through 123 inclusive. Extending the standard syntax, you may omit a start or end specifier to mean "A1" or last column and last row, e.g."4:"
to mean row 4 to the end of the sheet.Union
"A1:B3,D1:G3"
and intersection"A1:C3 B2:D4"
specifiers are not supported.headers: boolean will treat the first extracted row as column headers and use their values as field names for returned objects. The default is false. If a value doesn't exist in the header row for a value, column names (A-ZZ) will be used instead. Underscores (_) are appended if field names are repeated.
With
{ range: "2:", headers: true }
, the above penguins data would be: