Tokenize Regex as Parameter #268

jongaull-nimbly · 2019-09-17T17:52:24Z

This PR adds support for a tokenizer parameter which gives the user more control over what constitutes a "token". If the tokenizer parameter is not set then the default regex is used.

kpdecker · 2020-08-16T19:42:07Z

What's the use case here?

jongaull-nimbly · 2020-08-18T17:12:15Z

The default regex is /(\s+|[()[\]{}'"]|\b)/ and the one I am using is /(\s+|[()[\]{}'"_]|\b)/.

It's been a while since I worked on this, but it looks like I wanted to add _ as a tokenizing character. I could also see a use-case here for using , for diff-ing CSV data or maybe . for diff-ing file names.

SkySor44 · 2021-12-01T21:37:48Z

I second this functionality. I have a use case for diffing two html strings and this would enable me to adjust the tokenizer to meet my needs.

ExplodingCabbage · 2023-12-15T18:30:24Z

README.md

-    * `options` : An object with options. Currently, only `context` is supported and describes how many lines of context should be included.
+    * `options` : An object with options.
+        * `context` : describes how many lines of context should be included.
+        * `tokenizer` : Overrides the default regex used to split text into words. supported by `diffWords` and `diffWordsWithSpace`


This is in the wrong place; you've documented it as a parameter of createTwoFilesPatch instead of diffWords.

ExplodingCabbage · 2023-12-18T13:10:16Z

Worth thinking about before I merge this - for Chinese and Japanese support, we might need tokenization logic too complicated to be encompassed in a regex, either built in to jsdiff or as something you can plug in yourself: #328 (comment). I want to carefully think through what I ultimately want the API to look like before merging this PR and make sure it's not gonna commit us to an API that's fundamentally incompatible with supporting Chinese and Japanese.

jongaull-nimbly added 2 commits September 17, 2019 10:43

Added support for custom tokenizer for diffWords

5ad75b3

Updated README for new tokenizer parameter

2e79dfe

ExplodingCabbage reviewed Dec 15, 2023

View reviewed changes

ExplodingCabbage added the non-breaking-change label Dec 18, 2023

ExplodingCabbage added the diffWords behaviour label Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize Regex as Parameter #268

Tokenize Regex as Parameter #268

jongaull-nimbly commented Sep 17, 2019

kpdecker commented Aug 16, 2020

jongaull-nimbly commented Aug 18, 2020

SkySor44 commented Dec 1, 2021

ExplodingCabbage Dec 15, 2023

ExplodingCabbage commented Dec 18, 2023

Tokenize Regex as Parameter #268

Are you sure you want to change the base?

Tokenize Regex as Parameter #268

Conversation

jongaull-nimbly commented Sep 17, 2019

kpdecker commented Aug 16, 2020

jongaull-nimbly commented Aug 18, 2020

SkySor44 commented Dec 1, 2021

ExplodingCabbage Dec 15, 2023

Choose a reason for hiding this comment

ExplodingCabbage commented Dec 18, 2023