Use dask to fetch data from Elasticsearch in parallel by sending the request to each shard separatelly.
The library tries to imitate the functionality of the ES Hadoop plugin for spark. dask-elk
performs a parallel read across all the target indices shards.
In order to achieve that it uses Elasticsearch scrolling mechanism.
To use the library and read from an index:
from dask_elk.client import DaskElasticClient
# First create a client
client = DaskElasticClient() # localhost Elasticsearch
index = 'my-index'
df = client.read(index=index, doc_type='_doc')
You can even pass a query to push down to elasticsearch, so that any filtering can be done on the Elasticsearch side. Because dask-elk
uses scroll mechanism aggregations are not supported
from dask_elk.client import DaskElasticClient
# First create a client
client = DaskElasticClient() # localhost Elasticsearch
query = {
"query" : {
"term" : { "user" : "kimchy" }
}
}
index = 'my-index'
df = client.read(query=query, index=index, doc_type='_doc')
Read documentation here