Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(TableParser): Add getCleanMatrix() method + test #113

Merged
merged 1 commit into from
Mar 25, 2022

Conversation

adrienjoly
Copy link
Owner

@adrienjoly adrienjoly commented Mar 25, 2022

Problem

As seen in https://github.com/adrienjoly/npm-pdfreader-example/blob/master/parseTable.js, it's complicated to render a table that was parsed from a PDF file using TableParser.

The existing getMatrix() method returned a 3-dimension matrix instead of a 2-dimension one, because there can be more than one textual item per column. (e.g. when a word is splitted into 2 items, for some reason)

Proposed solution

Add a getCleanMatrix() method that returns a 2-dimension matrix that can be used with console.table().

Example of use

    // the thresholds were determined manually, based on the horizontal position (x) for column headers
    const colThresholds = [6.8, 9.5, 13.3, 16.7, 18.4, 28, 32, 36, Infinity];

    const columnQuantitizer = (item) => {
      return colThresholds.findIndex(
        (colThreshold) => parseFloat(item.x) < colThreshold
      );
    };

    const table = new lib.TableParser();
    new PdfReader().parseFileItems("./test/sample-table.pdf", (err, item) => {
      if (err) console.error(err);
      else if (!item) {
        console.table(table.getCleanMatrix({ collisionSeparator: "" })); // 👈
      } else if (item.text) {
        table.processItem(item, columnQuantitizer(item));
      }
    });

Result

As displayed with console.table(cleanMatrix):

┌─────────┬───────────────────┬───────────┬──────────────┬──────────────┬─────────┬──────────────────────────┬─────────────┬─────────────┬────────┐
│ (index) │         0         │     1     │      2       │      3       │    4    │            5             │      6      │      7      │   8    │
├─────────┼───────────────────┼───────────┼──────────────┼──────────────┼─────────┼──────────────────────────┼─────────────┼─────────────┼────────┤
│    0    │     'Version'     │   'LTS'   │    'Date'    │     'V8'     │  'npm'  │ 'NODE_MODULE_VERSION[1]' │             │             │        │
│    1    │ 'Node.js 17.1.0'  │           │ '2021-11-09' │ '9.5.172.25' │ '8.1.2' │          '102'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
│    2    │ 'Node.js 17.0.1'  │           │ '2021-10-20' │ '9.5.172.21' │ '8.1.0' │          '102'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
│    3    │ 'Node.js 17.0.0'  │           │ '2021-10-19' │ '9.5.172.21' │ '8.1.0' │          '102'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
│    4    │ 'Node.js 16.14.2' │ 'Gallium' │ '2022-03-17' │ '9.4.146.24' │ '8.5.0' │           '93'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
│    5    │ 'Node.js 16.14.1' │ 'Gallium' │ '2022-03-16' │ '9.4.146.24' │ '8.5.0' │           '93'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
│    6    │ 'Node.js 16.14.0' │ 'Gallium' │ '2022-02-08' │ '9.4.146.24' │ '8.3.1' │           '93'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
│    7    │ 'Node.js 16.13.2' │ 'Gallium' │ '2022-01-10' │ '9.4.146.24' │ '8.1.2' │           '93'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
│    8    │ 'Node.js 16.13.1' │ 'Gallium' │ '2021-12-01' │ '9.4.146.24' │ '8.1.2' │           '93'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
│    9    │ 'Node.js 16.13.0' │ 'Gallium' │ '2021-10-26' │ '9.4.146.19' │ '8.1.0' │           '93'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
│   10    │ 'Node.js 16.12.0' │           │ '2021-10-20' │ '9.4.146.19' │ '8.1.0' │           '93'           │ 'Downloads' │ 'Changelog' │ 'Docs' │
└─────────┴───────────────────┴───────────┴──────────────┴──────────────┴─────────┴──────────────────────────┴─────────────┴─────────────┴────────┘

@adrienjoly adrienjoly self-assigned this Mar 25, 2022
@adrienjoly adrienjoly changed the title feat(TableParser): Add getCleanMatrix() method feat(TableParser): Add getCleanMatrix() method + test Mar 25, 2022
@adrienjoly adrienjoly marked this pull request as ready for review March 25, 2022 18:11
@adrienjoly adrienjoly merged commit 281eb70 into master Mar 25, 2022
@adrienjoly adrienjoly deleted the feat/tableparser-getcleanmatrix branch March 25, 2022 18:12
@github-actions
Copy link

🎉 This PR is included in version 1.4.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant