-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple shards #2
Comments
Usually, people split the inverted index by documents. This makes each posting lists shorter. Also, without splitting by terms, we don't need to split a query and send terms in the query to multiple computers. |
Is this proposal good?
some questions here.
|
I thought about this for a while, and agree with all you proposed. A few cents hopefully complementary: About 2. I am afraid that the search master process (or root) has no way to create processes on remote computers. Usually, we'd have to use SSH, MPI or Kubernetes to start remote processes. So there are generally two ways to let root know index servers (or leaves):
It seems that the second approach also address your question 2., as etcd is immortal (Paxos protocol based service that doesn't really crash). An interesting detail with approach 2 is that when a leaf registers itself to etcd, it actually writes a key-value pair, which could have a TTL (time-to-live) attached. The leaf can start a goroutine which updates the key-value pair and its TTL periodically. Without this periodical update, etcd will remove the registration information after TTL, which indicates the death of the leaf. About 4. and 5. I know that weak-and is able to do TF-IDF based retrieval, though I have not done this in my previous work. It is great you think about this! About 7. I understand your point as that a new document could be added to any leaf, as long as all leaves make sure that all posting lists are sorted by document IDs. In the future, we might want to implement a more sophisticated logic -- to add new document to the index server which consumes the minimal amount of memory, or even start new leaf if all existing leaves are running out of their memory quota. But that is for the future. * About Question 1.* Does "file" here mean files on disk? I think the answer is yes -- at least that each index server should checkpoints its FowardIndex periodically to disk files or to AWS S3 (which is actually HDFS I think). About Question 2. I think etcd and checkpointing look like a solution -- if the root checkpoints its status into etcd, leaves register themselves onto etcd, and leaves checkpoints index into S3/HDFS files. About Question 3. I don't feel that the root needs a queue of queries. But for each query, I think the root would need to maintain a heap to merge result lists from all leaves. |
When I "go get github.com/wangkuiyi/etcd", I got the following error: @brjg Please review my fix https://github.com/wangkuiyi/etcd/pull/1 @brjg I merged it. I expect that you can remove the previous checkout of github.com/wangkuiyi/etcd, and go get it now. |
That is because github.com/coreos/etcd has some recent changes. I updated On Sat, Jan 30, 2016 at 5:35 PM brjg [email protected] wrote:
|
Currently, the backend server runs as a single process, which maintains the index data structure in memory. This should works for a small scale problem. But for applications where there are many documents to be indexed and searched, the index data structure might be very big, that we need to partition it into multiple shards, and to start multiple backend server instances -- each maintains a shard. Also, how could the client makes a single RPC call to add documents to the (or all) correct shard.
The text was updated successfully, but these errors were encountered: