Skip to content

Commit

Permalink
Refactor prose
Browse files Browse the repository at this point in the history
  • Loading branch information
wooorm committed Jan 23, 2020
1 parent a74d895 commit e51562d
Showing 1 changed file with 38 additions and 38 deletions.
76 changes: 38 additions & 38 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,28 @@
[![Size][size-badge]][size]
[![Chat][chat-badge]][chat]

A Latin script language parser for [**retext**][retext] producing
[**NLCST**][nlcst] nodes.
A Latin-script language parser for [**retext**][retext] producing **[nlcst][]**
nodes.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ
ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont
les toilettes?”), `parse-latin` does a good job at tokenising it.
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum
penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”),
`parse-latin` does a good job at tokenizing it.

Note also that `parse-latin` does a decent job at tokenising
Latin-like scripts, Cyrillic (“Добро пожаловать!”), Georgian (“როგორა
ხარ?”), Armenian (“Շատ հաճելի է”), and such.
Note also that `parse-latin` does a decent job at tokenizing Latin-like scripts,
Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի
է”), and such.

## Installation
## Install

[npm][]:

```bash
```sh
npm install parse-latin
```

## Usage
## Use

```javascript
```js
var inspect = require('unist-util-inspect')
var Latin = require('parse-latin')

Expand Down Expand Up @@ -57,54 +57,52 @@ RootNode[1] (1:1-1:19, 0-18)

### `ParseLatin(value)`

Exposes the functionality needed to tokenise natural Latin-script
languages into a syntax tree.
Exposes the functionality needed to tokenize natural Latin-script languages into
a syntax tree.
If `value` is passed here, it’s not needed to give it to `#parse()`.

#### `ParseLatin#tokenize(value)`

Tokenise `value` (`string`) into letters and numbers (words), white space, and
everything else (punctuation). The returned nodes are a flat list without
paragraphs or sentences.
Tokenize `value` (`string`) into letters and numbers (words), white space, and
everything else (punctuation).
The returned nodes are a flat list without paragraphs or sentences.

###### Returns

[`Array.<NLCSTNode>`][nlcst] — Nodes.
[`Array.<Node>`][nlcst] — Nodes.

#### `ParseLatin#parse(value)`

Tokenise `value` (`string`) into an [NLCST][nlcst] tree. The returned node is
a `RootNode` with in it paragraphs and sentences.
Tokenize `value` (`string`) into an [NLCST][] tree.
The returned node is a `RootNode` with in it paragraphs and sentences.

###### Returns

[`NLCSTNode`][nlcst] — Root node.
[`Node`][nlcst] — Root node.

## Algorithm

> Note: The easiest way to see **how parse-latin tokenizes and parses**,
> is by using the [online parser demo](https://wooorm.github.io/parse-latin),
> which shows the syntax tree corresponding to the typed text.
> Note: The easiest way to see **how parse-latin tokenizes and parses**, is by
> using the [online parser demo][demo], which
> shows the syntax tree corresponding to the typed text.
`parse-latin` splits text into white space, word, and punctuation
tokens. `parse-latin` starts out with a pretty easy definition,
one that most other tokenisers use:
`parse-latin` splits text into white space, word, and punctuation tokens.
`parse-latin` starts out with a pretty easy definition, one that most other
tokenizers use:

* A “word” is one or more letter or number characters
* A “white space” is one or more white space characters
* A “punctuation” is one or more of anything else

Then, it manipulates and merges those tokens into an [NLCST][]
syntax tree, adding sentences and paragraphs where needed.

* Some punctuation marks are part of the word they occur in, e.g.,
`non-profit`, `she’s`, `G.I.`, `11:00`, `N/A`, `&c`,
`nineteenth- and...`
* Some full-stops do not mark a sentence end, e.g., `1.`, `e.g.`,
`id.`
* Although full-stops, question marks, and exclamation marks
(sometimes) end a sentence, that end might not occur directly
after the mark, e.g., `.)`, `."`
Then, it manipulates and merges those tokens into a ([nlcst][]) syntax tree,
adding sentences and paragraphs where needed.

* Some punctuation marks are part of the word they occur in, such as
`non-profit`, `she’s`, `G.I.`, `11:00`, `N/A`, `&c`, `nineteenth- and…`
* Some full-stops do not mark a sentence end, such as `1.`, `e.g.`, `id.`
* Although full-stops, question marks, and exclamation marks (sometimes) end a
sentence, that end might not occur directly after the mark, such as `.)`,
`."`
* And many more exceptions

## License
Expand Down Expand Up @@ -135,6 +133,8 @@ syntax tree, adding sentences and paragraphs where needed.

[npm]: https://docs.npmjs.com/cli/install

[demo]: https://wooorm.com/parse-latin/

[license]: license

[author]: https://wooorm.com
Expand Down

0 comments on commit e51562d

Please sign in to comment.