Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rqlited does not load #260

Closed
omerkorenn opened this issue Jan 16, 2017 · 13 comments
Closed

rqlited does not load #260

omerkorenn opened this issue Jan 16, 2017 · 13 comments

Comments

@omerkorenn
Copy link

Hi,
I preloaded a single node of rqlited (created a 1.5gb rafb.db file).
After gracefully shutdown of the process i tried to reload it, but the process ends with "set peer timeout expired" before all the data is being loaded to memory.

Please advise,

Thanks !!!

See the log here:
sudo ./rqlited --http 0.0.0.0:4001 ./node1

        _ _ _           _
       | (_) |         | |

_ __ __ | || |_ ___ | |
| '
/ _ | | | / _ / _ | The lightweight, distributed
| | | (| | | | || __/ (| | relational database.
|| _, |||___
|_,|
| |
|_|

[rqlited] 2017/01/16 16:22:49 rqlited starting, version v3.9.2, commit 46589e9, branch master
[store] 2017/01/16 16:22:49 SQLite in-memory database opened
[store] 2017/01/16 16:22:49 enabling single-node mode
[tcp] 2017/01/16 16:22:49 mux serving on 127.0.0.1:4002, advertising 127.0.0.1:4002
[cluster] 2017/01/16 16:22:49 service listening on 127.0.0.1:4002
2017/01/16 16:22:49 [INFO] raft: Node at 127.0.0.1:4002 [Follower] entering Follower state (Leader: "")
[rqlited] 2017/01/16 16:22:50 failed to set peer for 127.0.0.1:4002 to 0.0.0.0:4001: no leader available (retrying)
2017/01/16 16:22:50 [WARN] raft: Heartbeat timeout from "" reached, starting election
2017/01/16 16:22:50 [INFO] raft: Node at 127.0.0.1:4002 [Candidate] entering Candidate state
2017/01/16 16:22:50 [DEBUG] raft: Votes needed: 1
2017/01/16 16:22:50 [DEBUG] raft: Vote granted from 127.0.0.1:4002. Tally: 1
2017/01/16 16:22:50 [INFO] raft: Election won. Tally: 1
2017/01/16 16:22:50 [INFO] raft: Node at 127.0.0.1:4002 [Leader] entering Leader state
2017/01/16 16:22:50 [DEBUG] raft: Node 127.0.0.1:4002 updated peer set (2): [127.0.0.1:4002]
2017/01/16 16:22:50 [DEBUG] raft: Node 127.0.0.1:4002 updated peer set (2): [127.0.0.1:4002]
2017/01/16 16:22:50 [DEBUG] raft: Node 127.0.0.1:4002 updated peer set (2): [127.0.0.1:4002]
[cluster] 2017/01/16 16:23:01 received connection from 127.0.0.1:45257
[rqlited] 2017/01/16 16:23:11 failed to set peer for 127.0.0.1:4002 to 0.0.0.0:4001: timed out enqueuing operation (retrying)
[cluster] 2017/01/16 16:23:21 received connection from 127.0.0.1:45258
[rqlited] 2017/01/16 16:23:31 failed to set peer for 127.0.0.1:4002 to 0.0.0.0:4001: timed out enqueuing operation (retrying)
[rqlited] 2017/01/16 16:23:31 failed to set peer for localhost:4002 to 0.0.0.0:4001: set peer timeout expired

@otoolep
Copy link
Member

otoolep commented Jan 17, 2017

Thanks for the report @omerkorenn -- I will take a look.

@otoolep
Copy link
Member

otoolep commented Jan 18, 2017

@omerkorenn -- did you load the your node via this technique?

https://github.com/rqlite/rqlite/blob/master/doc/RESTORE_FROM_SQLITE.md

@otoolep
Copy link
Member

otoolep commented Jan 18, 2017

If you are loading from a SQLite dump, I would guess you're hitting the 10-second timeout at this line of code:

https://github.com/rqlite/rqlite/blob/master/store/store.go#L441

I can make this configurable, so you can increase it.

@otoolep
Copy link
Member

otoolep commented Jan 18, 2017

@omerkorenn -- can you build top of tree, as per these instructions:

https://github.com/rqlite/rqlite/blob/master/CONTRIBUTING.md#building-rqlite

If so, please launch rqlite with a larger timeout like so:

rqlited -raftapplytimeout 30s 

30 seconds is a suggestion, you might need to go higher. I'd be interested in knowing what value you need. v3.9.2 has a 10 second timeout.

@otoolep
Copy link
Member

otoolep commented Jan 21, 2017

OK, I'm going to assume this is solved.

@otoolep otoolep closed this as completed Jan 21, 2017
@omerkorenn
Copy link
Author

Hi,

Sorry for the late response.

The timeout is indeed the problem.
It is hard to know how much time is needed to load the data.
It really machine/dataset dependent, anyway i aim to use a ~32GB dataset.

Do you think it is possible to overcome the problem by trying to connect to the raft after loading the data instead of in parallel ?

@otoolep
Copy link
Member

otoolep commented Jan 21, 2017

Did you try any increasing the timeout at all?

@omerkorenn
Copy link
Author

Yeah sure, it worked with 120s

@otoolep
Copy link
Member

otoolep commented Jan 22, 2017

OK, 120s is kinda long. I don't follow your suggestion about "parallel". Can you explain more?

You could split the file you're loading into multiple smaller files, which will help you keep your timeout setting low.

@omerkorenn
Copy link
Author

Thanks i'll try this.

I saw that the store.open(..)
https://github.com/rqlite/rqlite/blob/master/cmd/rqlited/main.go#L176
returns before loading all data to memory, and publishAPIAddr is being called after this
https://github.com/rqlite/rqlite/blob/master/cmd/rqlited/main.go#L216
causing the process to fail after certain timeout.

So i thought maybe it is possible some how to wait the store to be ready before calling publishAPIAddr.

But i'm not sure about it

@otoolep
Copy link
Member

otoolep commented Jan 22, 2017

Interesting @omerkorenn -- you might be onto something there, though I would need to confirm that Open doesn't block until all Raft log messages have been applied.

Let me re-open to investigate.

@otoolep otoolep reopened this Jan 22, 2017
@otoolep
Copy link
Member

otoolep commented Feb 3, 2017

@omerkorenn -- I have confirmed that NewRaft does return before all the log entries have been applied. I'm trying to see if there is a way to allow it to block instead.

otoolep added a commit that referenced this issue Feb 3, 2017
@otoolep
Copy link
Member

otoolep commented Feb 3, 2017

@omerkorenn -- top of tree solves this problem correctly. It waits, by default, up to 120 seconds for the initial state to be applied. You no longer need to set -raftapplytimeout. If 120 seconds is not sufficiently long, you can increase it.

@otoolep otoolep closed this as completed Feb 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants