Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current implementation of last saved height is prone to miss height. #242

Open
sherpalden opened this issue Jun 4, 2024 · 11 comments
Open
Labels

Comments

@sherpalden
Copy link
Collaborator

func (r *Relayer) SaveChainsBlockHeight(ctx context.Context) {
    for _, chain := range r.chains {
        height, err := chain.Provider.QueryLatestHeight(ctx)
        if err != nil {
            r.log.Error("error occured when querying latest height", zap.Error(err))
            continue
        }
        if err := r.SaveBlockHeight(ctx, chain, height); err != nil {
            r.log.Error("error occured when saving block height", zap.Error(err))
            continue
        }
    }
}

Here,
lets suppose latest height is 100 and we are processing at 95. Assume the relayer crashed after saving the latest height of 101.
Now, next time we run the relayer, we start polling from 101. But actually we need to do it from 95.

@ibrizsabin
Copy link

ibrizsabin commented Jun 10, 2024

we need to save height only after processing is done for that height

@debendraoli
Copy link
Collaborator

debendraoli commented Jun 10, 2024

We use active listener using websocket. We don't save height everytime we process something. So saving height is periodic.
Detected events are immediately saved on disk, even it crashes it will go back to polling using the last saved height from disk.

Even subscription errors occurs we go back to polling and subscription at the same time programmatically.

@bcsainju
Copy link
Collaborator

I think this might cause an issue as the framework automatically saves height after x interval. This might work for chains with same implementation i.e. web sockets and fallback to rpc, this might not be true for all cases and might cause issue for other chains and the framework doesn't mandate the chains.

@debendraoli
Copy link
Collaborator

not having websocket, relying on poll would be matter of great concern in future.

This block interval saves does not mandate having websocket.

@debendraoli
Copy link
Collaborator

falling back to past height such as 5 mins ago would not cause an issue but missing might.

@sherpalden
Copy link
Collaborator Author

Suppose a scenario that a block(with height 100) is finalized in
a src chain(with block time of 500ms) at time 9:00:000 AM.
Now, suppose a periodic(every 5 minutes) block height saver got triggered at 9:00:500 AM
and saved height of 100 successfully in db. Now on the next trigger(at 9:05:500 AM)
chain will probably have the latest height of 700(100+300000/500).
Now lets say relayer crashed at 9:06:000 AM.
Now within these 5 minute interval relayer should be able to catch events, parse messages,
push them to the buffer, retrieve them from the buffer, and save each messages in the db of
all the blocks from 100-700. Can we gurantee that relayer would be able to do this in all
possible cases? Isn't there probability that relayer only saves messages till block 698 till
the point of crash?

Even if we can gurantee that relayer saves all the messages in db successfully in the above
case, next time we restart the relayer after being crashed, we will have to track polling
height as well which is not implemented currently. Lets say after 3 minutes of crash
relayer is restarted. Now, relayer has to poll events from height 700 to 1060(700+360).
At the time of polling what if we are only able to poll till 900 because of some issues like
failed to query events at height 900 or lets again say relayer crashed while polling at 900.
Shouldn't we be tracking this polling height as well?

Another scenario of subscription error you mentioned we handle that by polling. We might
get error at the time of polling as well, right? At that time we cannot just ignore that height
at which we are polling. We might have to keep track of that as well, right?

@ibrizsabin
Copy link

i think different people are talking about different things . it would be better if we had an architecture diagram of main loop. then it would be clearer for us all.

@debendraoli
Copy link
Collaborator

debendraoli commented Jun 10, 2024

I can answer them in great details but the problem we're facing in the past and then is not missed hight but processing of messages that are detected.

We had have never missed a single message until now in the relayer, every messages that are missed delivery are in the relayer but not processed, which we've identified the issue.

I understand this might be the matter of concern but not right now.

@pragyanshrestha-ibriz
Copy link
Collaborator

@bcsainju @debendraoli @sherpalden is this issue still relevant?

@bcsainju
Copy link
Collaborator

yes, this is still relevant but it's an edge case that might happen very rarely.

@debendraoli
Copy link
Collaborator

this is addressed here: #328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants