Indexing phone numbers & sip addresses in lucene is complicated. Most people use ngram tokenizers. We did that for a while with ngram min=3 & max=35, but the result was often 100s of tokens per sip address. Working in a call center focused company we quickly figured out how wasteful that is on the storage front. For us 6/7ths of our indexes were waisted on useless sip address tokens.
It's a hard problem to regex your way out of. An international phone number often includes a country code, but that can be 1, 2, or 3+ digits. A lot of people have requested elasticsearch integrate google's libphone library into a custom lucene analyzer. It hasn't happened yet, so here's a plugin that attempts to do just that.
Note: This is a young project. We'll improve as time goes on, but use at your own risk.
mvn package ./bin/plugin --url file:///....elasticsearch-phone/target/releases/elasticsearch-phone-1.0.0.zip --install elasticsearch-phone;
This project provides three analyzers that are intended for different contexts.
- The
phone
analyzer supports SIP URIs and other phone numbers and is intended to be used when indexing. It strips common prefixes such assip:
andtel:
(and indexes those as separate tokens) and tokenizes the phone number with various prefix lengths. - The
phone-email
analyzer extends thephone
analyzer with additional tokenization for email addresses (e.g. generating tokens for the user part and the domain part of an email address). - The
phone-search
analyzer is intended to be used as asearch_analyzer
with one of the other two analyzers used for indexing. It does minimal tokenization: If a term starts withsip:
ortel:
it strips this part and generates a token for it. The analyzer also strips a leading+
from phone numbers.
Provide a telephone or sip address prefixed by tel:
or sip:
with no spaces or symbols.
Your indexing template will need to specify the analyzer for the field. EG
"field": {
"type": "string",
"analyzer": "phone",
"search_analyzer": "phone-search"
}
Sample allowed inputs (see PhoneTokenizerIntegrationTest and PhoneSearchIntegrationTest for more):
- tel:+441344840400
- tel:+498362930830
- sip:abc@autosbcpc
- sip:+13119310462;[email protected]:8060
Input (with country code): sip:+13169410766;[email protected]:8060
Tokens:
sip:+13169410766;[email protected]:8060
sip:
13169410766;[email protected]:8060
13169410766;ext=2233
1
2233
3169410766
3
13
31
131
316
1316
3169
13169
31694
131694
316941
1316941
3169410
13169410
31694107
131694107
316941076
1316941076
13169410766
Input (without a country code): tel:8177148350
Tokens:
tel:8177148350
tel:
8177148350
8
81
817
8177
81771
817714
8177148
81771483
817714835
Input: [email protected]
Tokens:
[email protected]
user.name
user *
name *
domain.com *
domain *
com *
Tokens marked with *
are only generated by the phone-email
tokenizer.
Term queries will return exact matches without analyzing (without normalization as lowercase).
"query": {
"term" : { "field" : "8177" }
}
"query": {
"term" : { "field" : "domain" }
}
Match queries use the configured analyzer
(or search_analyzer
). In this example, the query will be translated to a boolean and
of two term queries for (tel:
and 8177
).
"query": {
"match" : {
"field" : {
"query" : "tel:8177",
"operator" : "and"
}
}
}