Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLSX support with ExcelJS #248

Merged
merged 35 commits into from
Sep 15, 2021
Merged

XLSX support with ExcelJS #248

merged 35 commits into from
Sep 15, 2021

Conversation

visnup
Copy link
Member

@visnup visnup commented Sep 3, 2021

Alternative to #215

Handles dates correctly, using UTC. Based largely on @Fil's previous exploration.

  • Test suite
  • Coerce formula errors to NaN
  • Escape the hyperlink case properly
  • Require ":" in string ranges (error if missing) to choose the meaning of "A" as either "A:" or "A:A", explicitly
  • Row numbers

Screen Shot 2021-09-10 at 3 45 19 PM

The main goal of this PR is to give people a way to extract data out of xlsx files quickly, efficiently, and correctly. To do that, we assume people should be able to visually recognize and find the data they want to extract, leveraging any previous familiarity with the xlsx file. We don't want to spend much effort on preserving styling or presentation (widths, fonts, value formatting, merged cells, frozen panes) or features used during building a spreadsheet like formula definitions (only results are extracted). At the same time, we might decide some presentation features are worth preserving if we think it would help people trust and recognize the extracted contents of the spreadsheet (number formatting I'd guess could fall into this).

Also, we should make it easy to maintain extracting new or updated data from an updated or mutable file, which implies the importance of the unbounded range feature ("12:" to mean extract starting at row 12 to the end).

The extracted data should work well with the rest of the downstream toolchains in Observable. So, plain JavaScript values (NaN) over descriptive objects ({ error: "#DIV/0" }).

The Workbook API hopefully will be reusable to represent a Google Sheet just as well in the future.

The assumed, frequent workflow of this API is that a user would pass all of the data first to Inputs.table or similar for exploratory recognition, then filter it down using the range and headers options described below.

Workbook.sheet(name, { range, headers }): Record<string, any>[]

Returns an array of objects representing the contents of cells in a specific sheet of the workbook. An example return value may look something like:

[ { A: "Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network." },
  { A: "Region", B: "Island", C: "Date Egg", D: "Culmen Length (mm)" },
  { A: "Anvers", B: "Torgersen", C: 2007-11-11, D: 39.1 },
  ...
]

Empty cells are skipped: objects will not include fields or values for them. Empty rows are not skipped assuming they'll aid in data recognition. Values are coerced to their JavaScript types: numbers, strings, Date objects. Formula results are included, but formula definitions ignored. Row numbers from the source sheet are included to assist with range specification and recognition.

name: string | number is a sheet name to get data for. If it's a string, it must match a name in Workbook.sheetNames. You can pass a zero-indexed number to get the corresponding sheet in order of Workbook.sheetNames. For example, sheet(0) is the first sheet.

range: string specifies a single rectangular range of cells to extract from the sheet as an Excel-based representation of a range, e.g. "B4:L123" to mean from cell B4 in the top left to L123 in the bottom right, inclusive. By default if no range is specified, the entire sheet is extracted.

Similar to Excel, the row or column part of the start or end may be omitted to mean the entire row or column, e.g. "4:123" to mean rows 4 through 123 inclusive. Extending the standard syntax, you may omit a start or end specifier to mean "A1" or last column and last row, e.g. "4:" to mean row 4 to the end of the sheet.

Union "A1:B3,D1:G3" and intersection "A1:C3 B2:D4" specifiers are not supported.

headers: boolean will treat the first extracted row as column headers and use their values as field names for returned objects. The default is false. If a value doesn't exist in the header row for a value, column names (A-ZZ) will be used instead. Underscores (_) are appended if field names are repeated.

With { range: "2:", headers: true }, the above penguins data would be:

[ { Region: "Anvers", Island: "Torgersen", "Date Egg": 2007-11-11, "Culmen Length (mm)": 39.1 },
  ...
]

src/xlsx.js Outdated Show resolved Hide resolved
src/fileAttachment.js Outdated Show resolved Hide resolved
@visnup visnup marked this pull request as ready for review September 5, 2021 05:26
@visnup
Copy link
Member Author

visnup commented Sep 5, 2021

Should fileAttachment.xslx() optionally take the same arguments as ExcelWorkbook.sheet(name, options) as a kind of shorthand? And if you pass them, it calls .sheet for you and returns that value? It would make the case of extracting a specific sheet and range a one-liner…

test/xlsx-test.js Outdated Show resolved Hide resolved
src/xlsx.js Outdated Show resolved Hide resolved
Copy link
Contributor

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review and suggested changes: #249

Here's a few more questions I'd have:

  • develop ExcelWorkbook in README.md?
  • discussion on https://observablehq.com/d/25d78559efffa7fb#comment-c830048f97493242
  • can we have {text: xxx} and no hyperlink property?
  • escaping html (in links)
  • add a few large files to test?
  • name: xlsx; ExcelWorkbook, or Spreadsheet?
  • confusion on the rows’ numbers (they start at 1 in the traditional string specifier, at 0 in array specifier)

src/xlsx.js Outdated Show resolved Hide resolved
src/xlsx.js Outdated Show resolved Hide resolved
src/xlsx.js Outdated
}
if (headerRow) for (let c = c0; c <= c1; c++) name(c);

const output = new Array(r1 - r0 + 1).fill({});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is filling the output with a shared empty object for all rows, whereas the rows with values are reassigned below to new objects. Do we want to use an empty object to represent rows without values, rather than undefined? If we do want to use an empty object, I think we’ll still want a distinct object for each row, rather than sharing the object across rows. That could be done by moving the output[r - r0] = {} below before the continue rather that using array.fill here.

Copy link
Member Author

@visnup visnup Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use sparseness/undefined initially but Inputs.table had trouble with it, throwing an error trying to get a field from it. re-using the same object was a pre-optimization from a memory usage standpoint. I also toyed with using a Symbol("empty") to make it even more explicit.

I slightly want to auto-filter these rows out of the return value since I feel like in usage that would be one of the first things I'd always end up writing anyway in the notebook, but that seemed like possibly surprising behavior at the same time?

src/xlsx.js Outdated Show resolved Hide resolved
Fil and others added 12 commits September 7, 2021 10:58
* document xlsx (minimalist, we'll work on the notebook first)

* fix coverage reporter

(avoids a crash on my computer; solution found at tapjs/tapjs#624)

* unknown sheet name

* simplify rows naming

* NN is always called on string (cell specifier such as "AA99")

* test name

* more range specifiers
Prettier + use default/base tap reporter
@visnup visnup mentioned this pull request Sep 14, 2021
@visnup
Copy link
Member Author

visnup commented Sep 14, 2021

@mbostock @Fil ok, this is ready for re-review. should be pretty close or final?

@visnup
Copy link
Member Author

visnup commented Sep 15, 2021

To add an idea while talking to @mbostock: in the future we could offer another option raw: true that would return cell values as Objects, but with valueOf and toString methods which would coerce them back to the more primitive values we're returning here. That would give people an intermediate method of getting at more of the stored information without having to use ExcelJS or SheetJS directly.

Copy link
Contributor

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great. Scoped to just what's needed.

I particularly like the hidden column "#" (row reference) which helps to refine the range interactively.

One last question: I wonder if we should clean up the cells we use as headers? I've tested this on xlsx files found at random on the web, and for example in
https://www.un.org/securitycouncil/sites/www.un.org.securitycouncil/files/cross-cutting_poc.xlsx one header is .........Formal agenda item (with many spaces, represented here with dots) and another one also has a linefeed:

 PP/OP/
PRST

it's usable (just a bit awkward), but if we want to do this clean-up, better earlier than later.

@visnup
Copy link
Member Author

visnup commented Sep 15, 2021

One last question: I wonder if we should clean up the cells we use as headers?

Yeah I think we could trim them? Unsure about weird spacing inside them though? Or awkward punctuation too...

@Fil
Copy link
Contributor

Fil commented Sep 15, 2021 via email

@visnup
Copy link
Member Author

visnup commented Sep 15, 2021

my preference would be to replace all \n \r by spaces, and trim

What about combine multiple whitespace characters inside the string into a single space?

@visnup
Copy link
Member Author

visnup commented Sep 15, 2021

Does d3-dsv do any type of trimming?

@Fil
Copy link
Contributor

Fil commented Sep 15, 2021

no… ok, then 8-)

@visnup
Copy link
Member Author

visnup commented Sep 15, 2021

no… ok, then 8-)

yeah ok, I think the default response is to encourage people to fix sources upstream and clean up those things in the xlsx files, then everyone benefits. I noticed a few values which could use trimming too and was tempted to do it in stdlib, so it's a slippery slope.

@visnup
Copy link
Member Author

visnup commented Sep 15, 2021

Can add later, but an idea re: the "#" field: I've been tempted to add a widths property on the returned array for Inputs.table to use. We could do it based on the actual widths in the spreadsheet too.

@mootari
Copy link
Member

mootari commented Sep 15, 2021

my preference would be to replace all \n \r by spaces, and trim

Why do you want to remove line breaks? Personally I've used them in headers quite a few times to keep long headers more readable/organized. (Sorry if this was already discussed.)

Edit: oops, look like this idea was already dropped in a later reply.

@visnup visnup merged commit db58b85 into main Sep 15, 2021
@visnup visnup deleted the visnup/xlsx branch September 15, 2021 16:42
@visnup visnup mentioned this pull request Sep 15, 2021
This was referenced Sep 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants