Tiny Search API service build on top of Fast API allowing to query data from Elastisearch with a rate-limiter
Simply start the app via docker compose:
docker-compose up
You can go to http://localhost:8000/docs/ and try the API from there
to import the data, you have to install dependencies locally:
poetry install
python3 import.py
In this section, I will outline thoughts and design decisions which were taken during implementation
I started with designing mapping, looking at the data I identified:`
id
,username
- these fields are unique identifiers, so they are likely to be selected by exact queries, do not have stemming requirementspost
,user_bio
- these fields are text fields, so we need full-text search features such as stemming, stop words filteringpost_like_count
,post_diamond_count
- integer values, so we can have range queries for them
I decided to proceed with standard
tokenizer, it's recommended one and demonstrated good results, comparing to default
settings of other tokenizers.
Filters included:
stop
- removes stop words, by default uses English dictionarylowercase
- sets all characters to be lowercase, as register doesn't matter for searchporter_stem
- English language stemmer
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"post": {
"type": "text"
},
"username": {
"type": "keyword"
},
"user_bio": {
"type": "text"
},
"post_like_count": {
"type": "integer"
},
"post_diamond_count": {
"type": "integer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"stop",
"lowercase",
"porter_stem"
]
}
}
}
}
}
- Stemming in English
- Multicolumn search
- Index optimizations (do not stem
id
/username
) - Typo support (couldn't make it work with multicolumn queries)
I have chosen FastAPI as a modern framework with async support, Swagger generation and etc.
Sadly, popular Elasticsearch libraries did not support all required features (async, PIT, scroll), so I decided to
write src/es_helper.py
for this to save a bit of time.
In order to provide consistent view for API client while paginating, I decided to utilize Point in time API usage, which allows a client to scroll data in a state they were at the moment client initialized first connection, even if they were updated. It is implemented as a query parameter, and return after each request
In order not to have json/dict hell, all data passed within the project, are wrapped into Pydantic models, including ElasticSearch responses.
Dockerfile utilizes multistage build with caching support, so final image is slimmer and recurrent builds are faster
Rate limiting is implemented by slowapi
library, which is based on a popular Flask implementation. It uses Redis
backend in order to allow to scale python containers without loosing global rate-limiting.
Swagger is available at http://localhost:8000/docs