Skip to content

Commit

Permalink
Add improved docs
Browse files Browse the repository at this point in the history
  • Loading branch information
wooorm committed Nov 11, 2022
1 parent 54baf82 commit 4d1626d
Showing 1 changed file with 123 additions and 45 deletions.
168 changes: 123 additions & 45 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,71 @@
[![Size][size-badge]][size]
[![Chat][chat-badge]][chat]

A Latin-script language parser for [**retext**][retext] producing **[nlcst][]**
nodes.
A natural language parser, for Latin-script languages, that produces [nlcst][].

## Contents

* [What is this?](#what-is-this)
* [When should I use this?](#when-should-i-use-this)
* [Install](#install)
* [Use](#use)
* [API](#api)
* [`ParseLatin()`](#parselatin)
* [Algorithm](#algorithm)
* [Types](#types)
* [Compatibility](#compatibility)
* [Related](#related)
* [Contribute](#contribute)
* [Security](#security)
* [License](#license)

## What is this?

This package exposes a parser that takes Latin-script natural language and
produces a syntax tree.

## When should I use this?

If you want to handle natural language as syntax trees manually, use this.

Alternatively, you can use the retext plugin [`retext-latin`][retext-latin],
which wraps this project to also parse natural language at a higher-level
(easier) abstraction.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum
penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”),
`parse-latin` does a good job at tokenizing it.
this project does a good job at tokenizing it.

Note also that `parse-latin` does a decent job at tokenizing Latin-like scripts,
Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի
է”), and such.
For English and Dutch, you can instead use [`parse-english`][parse-english] and
[`parse-dutch`][parse-dutch].

## Install
You can somewhat use this for Latin-like scripts, such as Cyrillic
(“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”),
and such.

This package is ESM only: Node 12+ is needed to use it and it must be `import`ed
instead of `require`d.
## Install

[npm][]:
This package is [ESM only][esm].
In Node.js (version 14.14+, 16.0+), install with [npm][]:

```sh
npm install parse-latin
```

In Deno with [`esm.sh`][esmsh]:

```js
import {ParseLatin} from 'https://esm.sh/parse-latin@5'
```

In browsers with [`esm.sh`][esmsh]:

```html
<script type="module">
import {ParseLatin} from 'https://esm.sh/parse-latin@5?bundle'
</script>
```

## Use

```js
Expand All @@ -39,7 +82,7 @@ const tree = new ParseLatin().parse('A simple sentence.')
console.log(inspect(tree))
```

Which, when inspecting, yields:
Yields:

```txt
RootNode[1] (1:1-1:19, 0-18)
Expand All @@ -58,58 +101,79 @@ RootNode[1] (1:1-1:19, 0-18)

## API

This package exports the following identifiers: `ParseLatin`.
This package exports the identifier `ParseLatin`.
There is no default export.

### `ParseLatin(value)`

Exposes the functionality needed to tokenize natural Latin-script languages into
a syntax tree.
If `value` is passed here, it’s not needed to give it to `#parse()`.
### `ParseLatin()`

#### `ParseLatin#tokenize(value)`
Create a new parser.

Tokenize `value` (`string`) into letters and numbers (words), white space, and
everything else (punctuation).
The returned nodes are a flat list without paragraphs or sentences.
#### `ParseLatin#parse(value)`

###### Returns
Turn natural language into a syntax tree.

[`Array.<Node>`][nlcst] — Nodes.
##### Parameters

#### `ParseLatin#parse(value)`
###### `value`

Tokenize `value` (`string`) into an [NLCST][] tree.
The returned node is a `RootNode` with in it paragraphs and sentences.
Value to parse (`string`).

###### Returns
##### Returns

[`Node`][nlcst] — Root node.
[`RootNode`][root].

## Algorithm

> Note: The easiest way to see **how parse-latin tokenizes and parses**, is by
> using the [online parser demo][demo], which
> shows the syntax tree corresponding to the typed text.
> 👉 **Note**:
> The easiest way to see how `parse-latin` parses, is by using the
> [online parser demo][demo], which shows the syntax tree corresponding to
> the typed text.
`parse-latin` splits text into white space, word, and punctuation tokens.
`parse-latin` starts out with a pretty easy definition, one that most other
tokenizers use:
`parse-latin` splits text into white space, punctuation, symbol, and word
tokens:

* A “word” is one or more letter or number characters
* A “white space” is one or more white space characters
* A “punctuation” is one or more of anything else
* “word” is one or more unicode letters or numbers
* “white space” is one or more unicode white space characters
* “punctuation” is one or more unicode punctuation characters
* “symbol” is one or more of anything else

Then, it manipulates and merges those tokens into a ([nlcst][]) syntax tree,
adding sentences and paragraphs where needed.
Then, it manipulates and merges those tokens into a syntax tree, adding
sentences and paragraphs where needed.

* Some punctuation marks are part of the word they occur in, such as
* some punctuation marks are part of the word they occur in, such as
`non-profit`, `she’s`, `G.I.`, `11:00`, `N/A`, `&c`, `nineteenth- and…`
* Some full-stops do not mark a sentence end, such as `1.`, `e.g.`, `id.`
* Although full-stops, question marks, and exclamation marks (sometimes) end a
* some periods do not mark a sentence end, such as `1.`, `e.g.`, `id.`
* although periods, question marks, and exclamation marks (sometimes) end a
sentence, that end might not occur directly after the mark, such as `.)`,
`."`
* And many more exceptions
* …and many more exceptions

## Types

This package is fully typed with [TypeScript][].
It exports no additional types.

## Compatibility

This package is at least compatible with all maintained versions of Node.js.
As of now, that is Node.js 14.14+ and 16.0+.
It also works in Deno and modern browsers.

## Related

* [`parse-english`](https://github.com/wooorm/parse-english)
— English (natural language) parser
* [`parse-dutch`](https://github.com/wooorm/parse-dutch)
— Dutch (natural language) parser

## Contribute

Yes please!
See [How to Contribute to Open Source][contribute].

## Security

This package is safe.

## License

Expand Down Expand Up @@ -141,10 +205,24 @@ adding sentences and paragraphs where needed.

[demo]: https://wooorm.com/parse-latin/

[esm]: https://gist.github.com/sindresorhus/a39789f98801d908bbc7ff3ecc99d99c

[esmsh]: https://esm.sh

[typescript]: https://www.typescriptlang.org

[contribute]: https://opensource.guide/how-to-contribute/

[license]: license

[author]: https://wooorm.com

[retext]: https://github.com/retextjs/retext

[nlcst]: https://github.com/syntax-tree/nlcst

[root]: https://github.com/syntax-tree/nlcst#root

[retext-latin]: https://github.com/retextjs/retext/tree/main/packages/retext-latin

[parse-english]: https://github.com/wooorm/parse-english

[parse-dutch]: https://github.com/wooorm/parse-dutch

0 comments on commit 4d1626d

Please sign in to comment.