XLSX support with ExcelJS #248

visnup · 2021-09-03T23:29:35Z

Alternative to #215

Handles dates correctly, using UTC. Based largely on @Fil's previous exploration.

Test suite
Coerce formula errors to NaN
Escape the hyperlink case properly
Require ":" in string ranges (error if missing) to choose the meaning of "A" as either "A:" or "A:A", explicitly
Row numbers

The main goal of this PR is to give people a way to extract data out of xlsx files quickly, efficiently, and correctly. To do that, we assume people should be able to visually recognize and find the data they want to extract, leveraging any previous familiarity with the xlsx file. We don't want to spend much effort on preserving styling or presentation (widths, fonts, value formatting, merged cells, frozen panes) or features used during building a spreadsheet like formula definitions (only results are extracted). At the same time, we might decide some presentation features are worth preserving if we think it would help people trust and recognize the extracted contents of the spreadsheet (number formatting I'd guess could fall into this).

Also, we should make it easy to maintain extracting new or updated data from an updated or mutable file, which implies the importance of the unbounded range feature ("12:" to mean extract starting at row 12 to the end).

The extracted data should work well with the rest of the downstream toolchains in Observable. So, plain JavaScript values (NaN) over descriptive objects ({ error: "#DIV/0" }).

The Workbook API hopefully will be reusable to represent a Google Sheet just as well in the future.

The assumed, frequent workflow of this API is that a user would pass all of the data first to Inputs.table or similar for exploratory recognition, then filter it down using the range and headers options described below.

Workbook.sheet(name, { range, headers }): Record<string, any>[]

Returns an array of objects representing the contents of cells in a specific sheet of the workbook. An example return value may look something like:

[ { A: "Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network." },
  { A: "Region", B: "Island", C: "Date Egg", D: "Culmen Length (mm)" },
  { A: "Anvers", B: "Torgersen", C: 2007-11-11, D: 39.1 },
  ...
]

Empty cells are skipped: objects will not include fields or values for them. Empty rows are not skipped assuming they'll aid in data recognition. Values are coerced to their JavaScript types: numbers, strings, Date objects. Formula results are included, but formula definitions ignored. Row numbers from the source sheet are included to assist with range specification and recognition.

name: string | number is a sheet name to get data for. If it's a string, it must match a name in Workbook.sheetNames. You can pass a zero-indexed number to get the corresponding sheet in order of Workbook.sheetNames. For example, sheet(0) is the first sheet.

range: string specifies a single rectangular range of cells to extract from the sheet as an Excel-based representation of a range, e.g. "B4:L123" to mean from cell B4 in the top left to L123 in the bottom right, inclusive. By default if no range is specified, the entire sheet is extracted.

Similar to Excel, the row or column part of the start or end may be omitted to mean the entire row or column, e.g. "4:123" to mean rows 4 through 123 inclusive. Extending the standard syntax, you may omit a start or end specifier to mean "A1" or last column and last row, e.g. "4:" to mean row 4 to the end of the sheet.

Union "A1:B3,D1:G3" and intersection "A1:C3 B2:D4" specifiers are not supported.

headers: boolean will treat the first extracted row as column headers and use their values as field names for returned objects. The default is false. If a value doesn't exist in the header row for a value, column names (A-ZZ) will be used instead. Underscores (_) are appended if field names are repeated.

With { range: "2:", headers: true }, the above penguins data would be:

[ { Region: "Anvers", Island: "Torgersen", "Date Egg": 2007-11-11, "Culmen Length (mm)": 39.1 },
  ...
]

src/xlsx.js

src/fileAttachment.js

General code clean up

visnup · 2021-09-05T05:51:36Z

Should fileAttachment.xslx() optionally take the same arguments as ExcelWorkbook.sheet(name, options) as a kind of shorthand? And if you pass them, it calls .sheet for you and returns that value? It would make the case of extracting a specific sheet and range a one-liner…

test/xlsx-test.js

src/xlsx.js

Fil

Code review and suggested changes: #249

Here's a few more questions I'd have:

develop ExcelWorkbook in README.md?
discussion on https://observablehq.com/d/25d78559efffa7fb#comment-c830048f97493242
can we have {text: xxx} and no hyperlink property?
escaping html (in links)
add a few large files to test?
name: xlsx; ExcelWorkbook, or Spreadsheet?
confusion on the rows’ numbers (they start at 1 in the traditional string specifier, at 0 in array specifier)

src/xlsx.js

mbostock · 2021-09-06T15:17:38Z

src/xlsx.js

+  }
+  if (headerRow) for (let c = c0; c <= c1; c++) name(c);
+
+  const output = new Array(r1 - r0 + 1).fill({});


This is filling the output with a shared empty object for all rows, whereas the rows with values are reassigned below to new objects. Do we want to use an empty object to represent rows without values, rather than undefined? If we do want to use an empty object, I think we’ll still want a distinct object for each row, rather than sharing the object across rows. That could be done by moving the output[r - r0] = {} below before the continue rather that using array.fill here.

I tried to use sparseness/undefined initially but Inputs.table had trouble with it, throwing an error trying to get a field from it. re-using the same object was a pre-optimization from a memory usage standpoint. I also toyed with using a Symbol("empty") to make it even more explicit.

I slightly want to auto-filter these rows out of the return value since I feel like in usage that would be one of the first things I'd always end up writing anyway in the notebook, but that seemed like possibly surprising behavior at the same time?

src/xlsx.js

* document xlsx (minimalist, we'll work on the notebook first) * fix coverage reporter (avoids a crash on my computer; solution found at tapjs/tapjs#624) * unknown sheet name * simplify rows naming * NN is always called on string (cell specifier such as "AA99") * test name * more range specifiers

Prettier + use default/base tap reporter

Co-authored-by: Mike Bostock <[email protected]>

visnup · 2021-09-14T22:24:06Z

@mbostock @Fil ok, this is ready for re-review. should be pretty close or final?

visnup · 2021-09-15T00:30:25Z

To add an idea while talking to @mbostock: in the future we could offer another option raw: true that would return cell values as Objects, but with valueOf and toString methods which would coerce them back to the more primitive values we're returning here. That would give people an intermediate method of getting at more of the stored information without having to use ExcelJS or SheetJS directly.

Fil

Really great. Scoped to just what's needed.

I particularly like the hidden column "#" (row reference) which helps to refine the range interactively.

One last question: I wonder if we should clean up the cells we use as headers? I've tested this on xlsx files found at random on the web, and for example in
https://www.un.org/securitycouncil/sites/www.un.org.securitycouncil/files/cross-cutting_poc.xlsx one header is .........Formal agenda item (with many spaces, represented here with dots) and another one also has a linefeed:

 PP/OP/
PRST

it's usable (just a bit awkward), but if we want to do this clean-up, better earlier than later.

visnup · 2021-09-15T14:31:08Z

One last question: I wonder if we should clean up the cells we use as headers?

Yeah I think we could trim them? Unsure about weird spacing inside them though? Or awkward punctuation too...

Fil · 2021-09-15T14:41:51Z

my preference would be to replace all \n \r by spaces, and trim

visnup · 2021-09-15T14:53:52Z

my preference would be to replace all \n \r by spaces, and trim

What about combine multiple whitespace characters inside the string into a single space?

visnup · 2021-09-15T14:55:47Z

Does d3-dsv do any type of trimming?

Fil · 2021-09-15T15:11:04Z

no… ok, then 8-)

visnup · 2021-09-15T15:34:28Z

no… ok, then 8-)

yeah ok, I think the default response is to encourage people to fix sources upstream and clean up those things in the xlsx files, then everyone benefits. I noticed a few values which could use trimming too and was tempted to do it in stdlib, so it's a slippery slope.

visnup · 2021-09-15T15:35:59Z

Can add later, but an idea re: the "#" field: I've been tempted to add a widths property on the returned array for Inputs.table to use. We could do it based on the actual widths in the spreadsheet too.

mootari · 2021-09-15T16:05:29Z

my preference would be to replace all \n \r by spaces, and trim

Why do you want to remove line breaks? Personally I've used them in headers quite a few times to keep long headers more readable/organized. (Sorry if this was already discussed.)

Edit: oops, look like this idea was already dropped in a later reply.

XLSX support with ExcelJS

a8f7998

visnup commented Sep 3, 2021

View reviewed changes

src/xlsx.js Outdated Show resolved Hide resolved

Prettier

38fceab

visnup requested review from mbostock and sydneypalumbo September 3, 2021 23:53

mbostock reviewed Sep 4, 2021

View reviewed changes

src/fileAttachment.js Outdated Show resolved Hide resolved

visnup added 2 commits September 4, 2021 14:29

Change range option to nested arrays

9226446

General code clean up

Tests and bug fixes

77159f0

visnup marked this pull request as ready for review September 5, 2021 05:26

Fil reviewed Sep 5, 2021

View reviewed changes

test/xlsx-test.js Outdated Show resolved Hide resolved

Fil reviewed Sep 5, 2021

View reviewed changes

src/xlsx.js Outdated Show resolved Hide resolved

Respect header row order when resolving conflicts

d8904d0

Fil reviewed Sep 6, 2021

View reviewed changes

mbostock reviewed Sep 6, 2021

View reviewed changes

src/xlsx.js Outdated Show resolved Hide resolved

mbostock reviewed Sep 6, 2021

View reviewed changes

src/xlsx.js Outdated Show resolved Hide resolved

mbostock reviewed Sep 6, 2021

View reviewed changes

src/xlsx.js Outdated Show resolved Hide resolved

Fil and others added 12 commits September 7, 2021 10:58

Column only range test case

fd177b0

sheetNames is enumerable

f6ddcff

One more test to check for empty columns

9b9eab6

Prettier + use default/base tap reporter

Add Node 16 to the test matrix

a845086

Revert reporter to classic for Node 16

f30b626

Don't fail matrix quickly in actions

e421983

More coverage.

e8b0153

Example of .xlsx in README

e7c82d4

Remove Excel from Workbook naming

1440400

Fix dates

d444ebe

Fix for sharedFormula

410f4c9

visnup and others added 7 commits September 13, 2021 16:19

Update README.md

fcb6eb7

Co-authored-by: Mike Bostock <[email protected]>

Apply suggestions from code review

162d55e

Co-authored-by: Mike Bostock <[email protected]>

Simplify hyperlinks

9c9e91b

Prettier

1a0345e

Pass options through

5daef26

Rename helper functions for clarity, range tests

92c4af1

Simpler

6f13d59

visnup mentioned this pull request Sep 14, 2021

Fil/xlsx #251

Closed

visnup requested a review from mbostock September 14, 2021 17:53

visnup added 2 commits September 14, 2021 10:55

Consistent comment format

c52b73f

Consistent regexes

5b21a79

visnup force-pushed the visnup/xlsx branch from 2c7fd82 to 5b21a79 Compare September 14, 2021 17:59

Fix hyperlinks for certain cases

0a59d0c

Fil reviewed Sep 15, 2021

View reviewed changes

Fil approved these changes Sep 15, 2021

View reviewed changes

visnup merged commit db58b85 into main Sep 15, 2021

visnup deleted the visnup/xlsx branch September 15, 2021 16:42

visnup mentioned this pull request Sep 15, 2021

Update Inputs to 0.9.2. #253

Merged

This was referenced Sep 19, 2021

update recommended libraries #257

Merged

XLSX file attachments #215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLSX support with ExcelJS #248

XLSX support with ExcelJS #248

visnup commented Sep 3, 2021 •

edited

Loading

visnup commented Sep 5, 2021

Fil left a comment

mbostock Sep 6, 2021

visnup Sep 7, 2021 •

edited

Loading

visnup commented Sep 14, 2021

visnup commented Sep 15, 2021 •

edited

Loading

Fil left a comment •

edited

Loading

visnup commented Sep 15, 2021

Fil commented Sep 15, 2021 via email

visnup commented Sep 15, 2021

visnup commented Sep 15, 2021

Fil commented Sep 15, 2021

visnup commented Sep 15, 2021 •

edited

Loading

visnup commented Sep 15, 2021

mootari commented Sep 15, 2021 •

edited

Loading

XLSX support with ExcelJS #248

XLSX support with ExcelJS #248

Conversation

visnup commented Sep 3, 2021 • edited Loading

Workbook.sheet(name, { range, headers }): Record<string, any>[]

visnup commented Sep 5, 2021

Fil left a comment

Choose a reason for hiding this comment

mbostock Sep 6, 2021

Choose a reason for hiding this comment

visnup Sep 7, 2021 • edited Loading

Choose a reason for hiding this comment

visnup commented Sep 14, 2021

visnup commented Sep 15, 2021 • edited Loading

Fil left a comment • edited Loading

Choose a reason for hiding this comment

visnup commented Sep 15, 2021

Fil commented Sep 15, 2021 via email

visnup commented Sep 15, 2021

visnup commented Sep 15, 2021

Fil commented Sep 15, 2021

visnup commented Sep 15, 2021 • edited Loading

visnup commented Sep 15, 2021

mootari commented Sep 15, 2021 • edited Loading

visnup commented Sep 3, 2021 •

edited

Loading

visnup Sep 7, 2021 •

edited

Loading

visnup commented Sep 15, 2021 •

edited

Loading

Fil left a comment •

edited

Loading

visnup commented Sep 15, 2021 •

edited

Loading

mootari commented Sep 15, 2021 •

edited

Loading