A phonetic matching library. Includes text utilities to do string comparisons on phonemes (the sound of the string), as opposed to characters.
Docs can be found at: https://microsoft.github.io/PhoneticMatching/
Supported API:
- C++
- Node.js (>=8.11.2)
- C# .NET Core (>=2.1)
Supported Languages
- English
Current pre-built binaries offered to save the trouble of compiling the source locally.
- node-v{72,67,64,59,57}-{win32,linux,darwin}-{x64}
(Run node -p "process.versions.modules"
to see which Node-ABI in use.)
This repository consists of TypeScript and native dependencies built with node-gyp
. See package.json
for various scripts for the development process.
For first time building remember to npm install
This repository uses git submodules. If paths are outdated or non-existent run git submodule update --init --recursive
To install from NPM
npm install phoneticmatching
See the typings for more details.
Classes prefixed with En
make certain assumptions that are specific to the English language.
import { EnPronouncer, EnPhoneticDistance, FuzzyMatcher, AcceleratedFuzzyMatcher, EnHybridDistance, StringDistance } from "phoneticmatching";
Speech The namespace containing the type interfaces of the library objects.
EnPronouncer Pronounces a string, as a General English speaker, into its IPA string or array of Phones format.
matchers module:
-
FuzzyMatcher Main use case for this library. Returns matches against a list of targets for a given query. The comparisons are not remembered and therefore better for one-off use cases.
-
AcceleratedFuzzyMatcher Same interface as
FuzzyMatcher
but the list of targets are precomputed, so beneficial for multiple queries at the cost of a higher initialization time. -
EnContactMatcher A domain specialization of using the
AcceleratedFuzzyMatcher
for English speakers searching over a list of names. Does additional preprocessing and setups up the distance function for you. -
EnPlaceMatcher A domain specialization of using the
AcceleratedFuzzyMatcher
for English speakers searching over a list of places. Does additional preprocessing and setups up the distance function for you.
distance module:
-
EnPhoneticDistance Returns a metric distance score between two English pronunciations.
-
StringDistance Returns a metric distance score between two strings (edit distance).
-
EnHybridDistance Returns a metric distance score based on a combination of the two above distance metrics (English pronunciations and strings).
-
DistanceInput Input object for EnHybridDistance. Hold the text and the pronunciation of that text
nlp module:
-
EnPreProcessor English Pre-processor.
-
EnPlacesPreProcessor English Pre-processor with specific rules for places.
-
SplittingTokenizer Tokenizing base-class that will split on the given RegExp.
Here are some example of how to import modules and classes:
import { EnContactMatcher, EnPlaceMatcher } from "phoneticmatching";
import * as Matchers from "phoneticmatching/lib/matchers";
JavaScript
// Import core functionality from the library.
const { EnPhoneticDistance, FuzzyMatcher } = require("phoneticmatching");
// A distance metric over pronunciations.
const metric = new EnPhoneticDistance();
// The target list to match against.
const targets = [
"Apple",
"Banana",
"Blackberry",
"Blueberry",
"Grapefruit",
"Pineapple",
"Raspberry",
"Strawberry",
];
// Create the fuzzy matcher.
const matcher = new FuzzyMatcher(targets, metric);
// Find the nearest match.
const result = matcher.nearest("blu airy");
/* The result should be:
* {
* // The object from the targets list.
* element: 'Blueberry',
* // The distance score the from distance function.
* distance: 0.041666666666666664
* }
*/
console.log(result);
C#
using System;
// Import core functionality from the library.
using Microsoft.PhoneticMatching.Matchers.FuzzyMatcher.Normalized;
public class Program
{
public static void Main(string[] args)
{
// The target list to match against.
string[] targets =
{
"Apple",
"Banana",
"Blackberry",
"Blueberry",
"Grapefruit",
"Pineapple",
"Raspberry",
"Strawberry",
};
// Create the fuzzy matcher.
var matcher = new EnPhoneticFuzzyMatcher<string>(targets);
// Find the nearest match.
var result = matcher.FindNearest("blu airy");
/* The result should be:
* {
* // The object from the targets list.
* element: 'Blueberry',
* // The distance score the from distance function.
* distance: 0.0416666666666667
* }
*/
Console.WriteLine("element : [{0}] - distance : [{1}]", result.Element, result.Distance);
}
}
npm run tsc
# X is the parallelization number, usually set to the number of cores of the machine.
# This cleans and rebuilds everything.
JOBS=X npm run rebuild
# For incremental builds.
JOBS=X npm run build
# Requires native dependencies built, but TypeScript transpiling not required.
npm test
# Generate the doc files from the docstrings.
npm run build-docs
# Builds everything, TypeScript & native & docs, as a release build.
npm run release
Note that the .js library code and native dependencies will be deployed separately. Npm registries will be used for the .js code, node-pre-gyp
will be used for prebuilt dependencies while falling back to building on the client.
# Pushes pack to npmjs.com or a private registry if a .npmrc exists.
npm publish
# Packages a ./build/stage/{version}/maluubaspeech-{node_abi}-{platform}-{arch}.tar.gz.
# See package.json:binary.host on where to put it.
npm run package
A .NET Core NuGet package is published for this project. The package is published by Microsoft. Hence, it must follow guidance at https://aka.ms/nuget and sign package content and package itself with an official Microsoft certificate. To ease signing and publishing process, we integrate ESRP signing to Azure DevOps build tasks. To publish a new version of the package, create a release for the latest build (Pipelines->Releases->PublishNuget->Create a release).
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Security issues and bugs should be reported privately, via email, to the Microsoft Security Response Center (MSRC) at [email protected]. You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Further information, including the MSRC PGP key, can be found in the Security TechCenter.
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
See sources for licenses of dependencies.