Persian (Farsi) text pre processing (normalize, number, punctuation, white space, stop word & ...)
npm install --save persian-preprocess
const persianPreProcess = require('persian-preprocess');
Parameter | Type | Required | Descriptiopn |
---|---|---|---|
text | String | Yes | Text to process |
debug | Boolean | Yes | Debug system status |
const text = 'text to process';
const debug = true;
const processedText = persianPreProcess(text, debug)
.normalize()
.number()
.lowercase()
.punctuation()
.remove()
.stopword()
.emoticon()
.whitespace();
- Above code is just a sample of all pre process methods and because some of the methods require parameters this code won't work correctly. For complete functional sample please check Full Sample
Normalization Methods | Description |
---|---|
normalize | Normalization process |
number | Change numbers locale |
lowercase | Lowercase all characters |
punctuation | Remove punctuation |
remove | Remove selected characters |
stopword | Remove stop words |
emoticon | Remove emoticons |
whitespace | Remove duplicate whitespaces |
Result Methods | Description |
---|---|
toString | Get processed text |
toArray | Get list of all words |
toUnique | Get list of unique words |
getDebug | Get pre process debug data |
Normalization process
Parameter | Type | Required | Descriptiopn |
---|---|---|---|
config | Object | No | Normalization config (table below) |
Configuration | Type | Description | Sample Characters |
---|---|---|---|
persian | boolean | Normalize persian characters | ﭐ ݓ ك ﻱ |
english | boolean | Normalize english characters | ᗩ ℳ Ѡ ⓡ ⒵ |
arabic | boolean | Normalize arabic characters | ﷲ ﷺ |
number | boolean | Normalize number characters | ٥ ⑩ |
math | boolean | Normalize math characters | ¼ ⅞ |
html | boolean | Normalize html characters | < |
punctuation | boolean | Normalize punctuation characters | ʕ ʔ ℅ ٪ |
special | boolean | Normalize special characters | ᴁ lj st |
- Default value for all configurations sets is true and the normalization process will use for all of them by default
- Setting configurations value to false will ignore normalization process for the set
// No configuration (using default value)
const processedText = persianPreProcess(text, debug).normalize();
// Default configuration (Same result with the code above)
const processedText = persianPreProcess(text, debug).normalize({
persian: true,
english: true,
arabic: true,
number: true,
math: true,
html: true,
punctuation: true,
special: true
});
// Ignore HTML characters normalization
const processedText = persianPreProcess(text, debug).normalize({
html: false
});
Change numbers locale
Parameter | Type | Required | Descriptiopn |
---|---|---|---|
language | Enum: 'persian', 'english' | Yes | Numeric characters locale |
// 0, 1 ... 9
const processedText = persianPreProcess(text, debug).number('english');
// ۰, ۱, ... ۹
const processedText = persianPreProcess(text, debug).number('persian');
Lowercase all characters
// 'Hello' => 'hello'
const processedText = persianPreProcess(text, debug).lowercase();
Remove punctuation
Parameter | Type | Required | Descriptiopn |
---|---|---|---|
config | Object | No | Punctuation removal config (table below) |
Configuration | Type | Description | Sample Characters |
---|---|---|---|
basic | boolean or null | Basic punctuations | ' " \ / , ( | ) |
mark | boolean or null | Special punctuations | \r \n \t \0 |
diacritic | boolean or null | Arabic diacritics | ٌ ٍ ً ّ |
unicode | boolean or null | Unicode punctuations | ZERO WIDTH NON-JOINER |
- Default value for all configurations sets is true and the punctuations will remove using space character
- Setting value to null will remove the punctuations and wont replace them with any character
- Setting value to false will ignore punctuations removal for the set
// No configuration (using default value)
const processedText = persianPreProcess(text, debug).punctuation();
// Default configuration (Same result with the code above)
const processedText = persianPreProcess(text, debug).punctuation({
basic: true,
mark: true,
diacritic: true,
unicode: true
});
// Ignore UNICODE punctuation removal
const processedText = persianPreProcess(text, debug).punctuation({
unicode: false
});
/**
* Using NULL as configuration value for basic punctuations
*
* Using true (default value): 'in-line' > 'in line'
* Using null : 'in-line' > 'inline'
*/
const processedText = persianPreProcess(text, debug).punctuation({
basic: null
});
Remove selected characters
Parameter | Type | Required | Descriptiopn |
---|---|---|---|
config | Object | No | Character removal config (table below) |
Configuration | Type | Description | Sample Characters |
---|---|---|---|
number | boolean or null | Numeric characters | 0 9 ۰ ۹ |
persian | boolean or null | Persian characters | آ ا ی |
english | boolean or null | English characters | A Z a z |
length | number | Words with specific length |
-
for number, persian and english configurations
- Default value is false and the character removal process will be ignored by default
- Setting value to true will remove all the chacters in set and replace them with space character
- Setting value to null will remove the characters and wont replace them with any character
-
Setting length configuration will remove all words with the length equal or less than given value
/**
* No configuration (using default value)
* Using method with no configuration wont make any changes to text
*/
const processedText = persianPreProcess(text, debug).remove();
// Removing all characters
const processedText = persianPreProcess(text, debug).remove({
number: true,
persian: true,
english: true
});
/**
* Using NULL as configuration value for english characters
*
* Using true : 'in-line' > 'in line'
* Using null : 'in-line' > 'inline'
*/
const processedText = persianPreProcess(text, debug).remove({
english: null
});
/**
* Using length configuration
* 'this is a text' > 'this text'
*/
const processedText = persianPreProcess(text, debug).remove({
length: 2
});
Remove stop words
Parameter | Type | Required | Descriptiopn |
---|---|---|---|
config | Object | No | Stopword removal config (table below) |
Configuration | Type | Description | Sample Words |
---|---|---|---|
persian | boolean | Persian stopwords | در با به |
english | boolean | English stopwords | in at on |
custom | string[] | List of custom Words |
-
for persian and english configurations
- Default value is false and the stopwords removal process will be ignored by default
- Setting value to true will remove all the stopwords in set
-
Setting custom configuration will remove all words in given list
/**
* No configuration (using default value)
* Using method with no configuration wont make any changes to text
*/
const processedText = persianPreProcess(text, debug).stopword();
// Removing all stopwords
const processedText = persianPreProcess(text, debug).stopword({
persian: true,
english: true
});
/**
* Using custom list configuration
* 'this is a text' > ' is text'
*/
const processedText = persianPreProcess(text, debug).stopword({
custom: ['this', 'a']
});
Remove emoticons
Parameter | Required | Descriptiopn |
---|---|---|
replace | No | Value of this parameter can only be NULL |
- Be default (calling method with no parameter) all emoticons will remove using space character
- Setting replace value to null will remove the emoticons and wont replace them with any character
// No configuration (using default value)
const processedText = persianPreProcess(text, debug).emoticon();
/**
* Using NULL as replace parameter value
*
* No parameter : 'I💓U' > 'I U'
* Using null : 'I💓U' > 'IU'
*/
const processedText = persianPreProcess(text, debug).emoticon(null);
Remove duplicate whitespaces
// 'this is a text ' => 'this is a text '
const processedText = persianPreProcess(text, debug).whitespace();
Get processed text
/**
* text : '❶ text and ❶ number'
* result : '1 text and 1 number'
*/
const stringValue = processedText.toString();
Get list of all words
/**
* text : '❶ text and ❶ number'
* result : ['1', 'text', 'and', '1', 'number']
*/
const arrayList = processedText.toArray();
Get list of unique words
/**
* text : '❶ text and ❶ number'
* result : ['1', 'text', 'and', 'number']
*/
const uniqueList = processedText.toUnique();
Get pre process debug data
// See Full Sample
const debugInfo = processedText.getDebug();
const text = `
استفاده از حرف ك عربی و کاراکتر خاص ﷼ و عدد عربی ٦
انگلیسی: using ß character and ⅜ and < ⒄℅
حط دوم انگلیسی: and special character: NJ
شکلک: 😃 👦🏿 🚩 👱🏽 🍉 🏒 🚍 🥬
انتهای متن
`;
const persianPreProcess = require('persian-preprocess');
const processedText = persianPreProcess(text, true)
// Normalize
.normalize()
// Change number locale to persian
.number('persian')
// Lowercase all characters
.lowercase()
// Remove all punctuation except marks (i.e.: \n)
.punctuation({
mark: false
})
// Remove all numeric characters (not using space space character)
.remove({
number: null,
})
// Remove persian, english and two custom stop words
.stopword({
custom: ['حرف', 'خط']
})
// Remove emoticons
.emoticon()
// Remove duplicate whitespaces
.whitespace();
const result = processedText.toString();
/**
استفاده عربی کاراکتر خاص ریال عدد عربی
انگلیسی using b character
انگلیسی special character nj
شکلک
انتهای متن
*/
console.log(processedText.getDebug());
/**
* Setting debug parameter as true for persianPreProcess
* will activate debug system and debug data will be like:
*/
{
TOTAL: { duration: 0.054, change: -96, length: 212 },
normilize: {
duration: 0.018,
change: 4,
length: 216,
match: [
'ك', 'ß', '﷼',
'٦', '⒄', '⅜',
'<', '℅', 'NJ'
]
},
number: { duration: 0.001, change: 0, length: 216, match: [] },
lowercase: { duration: 0, change: 0, length: 216 },
punctuation: {
duration: 0,
change: 0,
length: 216,
match: [ ':', '/', '<', '%' ]
},
remove: {
duration: 0,
change: -5,
length: 211,
match: [ '۶', '۳', '۸', '۱', '۷' ]
},
stopword: {
duration: 0.028,
change: -6,
length: 205,
match: [
'and', ' دوم ',
' از ', ' ک ',
' و ', ' حرف ',
' حط '
]
},
emoticon: {
duration: 0.002,
change: -10,
length: 195,
match: [
'😃', '👦', '🏿',
'🚩', '👱', '🏽',
'🍉', '🏒', '🚍',
'🥬'
]
},
whitespace: { duration: 0.001, change: -79, length: 116 }
}
Name | Description |
---|---|
duration | Process time in millisecond |
change | Number of characters added or removed from Text value |
length | Text value length after process |
match | List of matched characters/words in process |
git clone https://github.com/webilix/persian-preprocess.git
npm install
npm test