Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple shards #2

Open
wangkuiyi opened this issue Jan 28, 2016 · 5 comments
Open

Multiple shards #2

wangkuiyi opened this issue Jan 28, 2016 · 5 comments
Assignees

Comments

@wangkuiyi
Copy link
Contributor

Currently, the backend server runs as a single process, which maintains the index data structure in memory. This should works for a small scale problem. But for applications where there are many documents to be indexed and searched, the index data structure might be very big, that we need to partition it into multiple shards, and to start multiple backend server instances -- each maintains a shard. Also, how could the client makes a single RPC call to add documents to the (or all) correct shard.

@wangkuiyi wangkuiyi changed the title Considering Multiple shards and replica Jan 28, 2016
@wangkuiyi wangkuiyi changed the title Multiple shards and replica Multiple shards Jan 28, 2016
@wangkuiyi
Copy link
Contributor Author

Usually, people split the inverted index by documents. This makes each posting lists shorter. Also, without splitting by terms, we don't need to split a query and send terms in the query to multiple computers.

@brjg
Copy link

brjg commented Jan 29, 2016

Is this proposal good?

  1. launch a master server
  2. the master launches n index server. each of them indexes a document shard
  3. the master server accepts queries. each query is sent to all index servers
  4. the term frequency on each index is calculated as usual
  5. the idf is calculated using the global number of document
  6. the results returned by each index server is merged and sorted by scores
  7. when adding document, it is added to a random index shard and the corresponding index server is called to add it

some questions here.

  1. does the master maintains a file that tells where each document shard is located?
  2. should every index server report their status periodically? if one dies, the master needs to restart it
    what if the master dies?
  3. does the master need to have a queue to process queries?

@yitopic
Copy link
Contributor

yitopic commented Jan 30, 2016

I thought about this for a while, and agree with all you proposed. A few cents hopefully complementary:

About 2.

I am afraid that the search master process (or root) has no way to create processes on remote computers. Usually, we'd have to use SSH, MPI or Kubernetes to start remote processes. So there are generally two ways to let root know index servers (or leaves):

  1. We start root before starting leaves; and leaves, when started, send their the network address to root via RPC. Leaves can know root's network address via command line flags. By this way, root must be started and be ready to accept RPC calls before we start leaves. We can use https://godoc.org/github.com/wangkuiyi/healthz#OK to check if the root has been started and ready for RPC calls.
  2. All leaves register their network addresses to a certain etcd directory known by root, and root watches changes to the directory so to know newly joined index servers. In this way, root could be started/restarted before or after leaves start/restart. It is notable that the official etcd client package in Go is too complex to use for me, so I wrote a much simplified version https://github.com/wangkuiyi/etcd .

It seems that the second approach also address your question 2., as etcd is immortal (Paxos protocol based service that doesn't really crash).

An interesting detail with approach 2 is that when a leaf registers itself to etcd, it actually writes a key-value pair, which could have a TTL (time-to-live) attached. The leaf can start a goroutine which updates the key-value pair and its TTL periodically. Without this periodical update, etcd will remove the registration information after TTL, which indicates the death of the leaf.

About 4. and 5.

I know that weak-and is able to do TF-IDF based retrieval, though I have not done this in my previous work. It is great you think about this!

About 7.

I understand your point as that a new document could be added to any leaf, as long as all leaves make sure that all posting lists are sorted by document IDs.

In the future, we might want to implement a more sophisticated logic -- to add new document to the index server which consumes the minimal amount of memory, or even start new leaf if all existing leaves are running out of their memory quota. But that is for the future.

* About Question 1.*

Does "file" here mean files on disk? I think the answer is yes -- at least that each index server should checkpoints its FowardIndex periodically to disk files or to AWS S3 (which is actually HDFS I think).

About Question 2.

I think etcd and checkpointing look like a solution -- if the root checkpoints its status into etcd, leaves register themselves onto etcd, and leaves checkpoints index into S3/HDFS files.

About Question 3.

I don't feel that the root needs a queue of queries. But for each query, I think the root would need to maintain a heap to merge result lists from all leaves.

@brjg
Copy link

brjg commented Jan 31, 2016

When I "go get github.com/wangkuiyi/etcd", I got the following error:
../../wangkuiyi/etcd/etcd.go:33: not enough arguments in call to transport.NewTransport

@brjg Please review my fix https://github.com/wangkuiyi/etcd/pull/1

@brjg I merged it. I expect that you can remove the previous checkout of github.com/wangkuiyi/etcd, and go get it now.

@yitopic
Copy link
Contributor

yitopic commented Jan 31, 2016

That is because github.com/coreos/etcd has some recent changes. I updated
github.com/wangkuiyi/etcd to keep up with that change. Please review:
https://github.com/wangkuiyi/etcd/pull/1

On Sat, Jan 30, 2016 at 5:35 PM brjg [email protected] wrote:

When I "go get github.com/wangkuiyi/etcd", I got the following error:
../../wangkuiyi/etcd/etcd.go:33: not enough arguments in call to
transport.NewTransport


Reply to this email directly or view it on GitHub
#2 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants