Firestore: Expose sufficient state to manually restart a Listen operation #2513

pauloevpr · 2018-09-17T06:49:20Z

Edited 2018-09-21: This was originally a bug report, and the context is preserved for informational purposes. It's now a feature request to permit clients to perform recovery from long-lasting network outages.

I have found two issues which indicate that FirestoreChangeListener does not handle network outage properly:

If the network connection is restored after being down for a few seconds, the listener does not receive the updates applied during the outage. Future updates are received normally though.
If the network connection is restored after being down for a few minutes, the Listener fails completely and future updates are no longer received. FirestoreChangeListener.ListenerTask.Status is set to TaskStatus.Faulted.

I was expecting the API to handle network outage automatically. I am not sure if that is a bug or a feature request.

The text was updated successfully, but these errors were encountered:

jskeet · 2018-09-17T07:55:48Z

I'll certainly treat it as a bug for now, but I'd like more information about each case.

Caveat for all of this: while I implemented the API, I'm not on the Firestore team so some subtleties may be outside my knowledge. I'll let the Firestore team know about this so they can check my assertions.

Case 1

I'd expect updates to be visible, but potentially bundled up. For example, I think this is a reasonable sequence of events updating a single value in a single document:

Value set to 1
Callback fired with value 1
Network dies
Value set to 2
Value set to 3
Network recovers
Callback fired with value 3
Value set to 4 (a significant time after the network recovery)
Callback fired with value 4

Note the absence of "callback fired with value 2". In other words, I'd expect this to be a way of keeping up to date with the contents of a document, not to observe the sequence of changes in the document.

Could you clarify whether you're seeing that behaviour, or whether you're not seeing the "callback fired with value 3" item?

Case 2

I'd expect the status to change to faulted after some timeout (I don't know exactly how long) and then not recover. If you're seeing it only change to faulted when the network connection recovers, that's a bug - but it's not a bug for the fault to be permanent once the task status has become faulted.

pauloevpr · 2018-09-18T02:36:22Z

Thank you for replying.

This is exactly the behavior I get for each case:

Case 1

Value set to 1
Callback fired with value 1
Network dies
Value set to 2
Value set to 3
Network recovers
Callback is not fired
Value set to 4
Callback fired with value 4
Value set to (a significant time after the network recovery)
Callback fired with value 5

Case 2

It happens just like you described. The status changes to faulted after some timeout and then it does not recover. The problem is, I would expect auto recovering in this case, which does not happen. I think that at least a callback to notify the application should be implemented, so the listener can be manually restored by the application.

jskeet · 2018-09-18T06:28:52Z

Okay, let's deal with case 2 first. This is out of scope for this library. The expectation is for this library to be used for server-side applications, so there's relatively little connection-handling and offline support, which the client-side SDKs would provide. There is a callback of sorts that the connection has failed, in that the task changes state to Faulted - you can await that or add a continuation if you want to retry at that point.

For case 1, do you have an example timing log? If the time between the network recovering and the value being set to 4 is very short, that list looks okay - but if you have to change the value in order to see the update, that's a different matter. Any more information you have would be useful for investigating this - obviously it's somewhat tricky to set up for testing, unless I can find some way of faking the connection failing just for the Listen part.

pauloevpr · 2018-09-20T22:08:53Z

For case 1, I have tested two different scenarios:

Network shutdown

This scenario can be emulated by disconnecting the network cable or by disabling the network interface on Control Panel\Network and Internet\Network Connections. The ListenerTask fails pretty much immediately:

21/09/2018 7:44:16 AM: Connection is up
21/09/2018 7:44:17 AM: Update received
21/09/2018 7:45:06 AM: Connection is down
21/09/2018 7:45:08 AM: ListenerTask terminated with status Faulted

Connection timeout

This scenario can be emulated by disconnecting the WAN port of the LAN router or by setting the default gateway of the network interface to an unreachable IP address. The ListenerTask fails after a very long timeout:

21/09/2018 7:49:05 AM: Connection is up
21/09/2018 7:49:08 AM: Update received
21/09/2018 7:49:22 AM: Connection is down
21/09/2018 8:01:47 AM: ListenerTask terminated with status Faulted

This is the console application that produced the output above:

internal class Program
{
    public static void Main(string[] args)
    {
        MonitorConnectionState();

        var db = CreateFirestoreDb(); // TODO
        var documentReference = db.Collection("Temp").Document("Temp");

        documentReference.Listen(snapshot =>
        {
            Console.WriteLine($"{DateTime.Now}: Update received");
        })
        .ListenerTask.ContinueWith(task =>
        {
            Console.WriteLine($"{DateTime.Now}: ListenerTask terminated with status {task.Status}");
        });

        Console.ReadLine();
    }
    public static void MonitorConnectionState()
    {
        var connected = false;
        var connectedNow = false;

        Task.Run(() =>
        {
            while (true)
            {
                try
                {
                    using (var client = new TcpClient())
                    {
                        var timeout = TimeSpan.FromSeconds(1);
                        client.ConnectAsync("firestore.googleapis.com", 443).Wait(timeout);

                        connectedNow = client.Connected;
                    }
                }
                catch (Exception)
                {
                    connectedNow = false;
                }

                if (connected != connectedNow)
                    Console.WriteLine($"{DateTime.Now}: Connection is {(connectedNow ? "up" : "down")}");

                connected = connectedNow;

                Thread.Sleep(100);
            }
        });
    }
}

pauloevpr · 2018-09-21T03:07:32Z

After testing this solution more carefully, I found out that there was an issue with my network which was causing the strange behavior of not receiving updates after a short network outage (30-60 seconds). So I confirm that case 1 is actually working as expected. The callback is fired whenever the connection is restored before timeout.

Only three cases cause the listener to fail:

Connection timeout: as verified by the console application above, the connection will timeout after a very long time;
Network interface shutdown: as verified by the console application above, the listener will fail as soon as the network interface is shutdown;
Network interface change: the listener will also fail as soon as IP settings get changed;

Based on that, it is possible to conclude the API is behaving as expected and therefore there is no real bug.

pauloevpr · 2018-09-21T03:19:34Z

Regarding case 2, it seems pretty simple to restart the listener manually. The only issue is, if the same IWatchState.ResumeToken is not used when creating the new listener, the state will be lost.

So my suggestion is to leave this thread as feature request to allow the manual creation of a listener using an existent state.

jskeet · 2018-09-21T06:06:53Z

Thanks very much for all your diligence in investigating this - it's much appreciated.

Making this a feature request works for me. I've edited the title of the issue and the first comment. I haven't looked into how hard it would be to expose enough state to resume listening - and there may be other reasons why it's not feasible at the server side, such as resume tokens only being relatively short-lived - but I'll ask the Firestore team what they think.

pauloevpr · 2018-09-21T08:50:10Z

In the mean time, I would like to share a temporary workaround which I am using to auto recover faulty listeners. It basically follows Jon's suggestion to add task continuation to detect faulty listeners. The listener is then re-created using the original parameters.

There is very little work for the caller to do. Instead of calling documentReference.Listen(snapshot => { }), call documentReference.AutoListen(snapshot => { }). The hidden AutoFirestoreChangeListener will do the job for you transparently.

Important: Because the listener is recovered by creating a complete new listener, there is no state preservation. So you may want to introduce a post fault callback:

documentReference.AutoListen(snapshot => { }, () => { Console.WriteLine("Listener failed. About to attempt to recover it."); });

Finally the code:

public static class AutoListenExtensions
{
    #region Methods

    public static FirestoreChangeListener AutoListen(this DocumentReference reference, Func<DocumentSnapshot, CancellationToken, Task> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
    {
        return new AutoFirestoreChangeListener(reference, callback, postFaultCallback, cancellationToken);
    }
    public static FirestoreChangeListener AutoListen(this DocumentReference reference, Action<DocumentSnapshot> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
    {
        return new AutoFirestoreChangeListener(reference, callback, postFaultCallback, cancellationToken);
    }
    public static FirestoreChangeListener AutoListen(this CollectionReference reference, Func<QuerySnapshot, CancellationToken, Task> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
    {
        return new AutoFirestoreChangeListener(reference, callback, postFaultCallback, cancellationToken);
    }
    public static FirestoreChangeListener AutoListen(this CollectionReference reference, Action<QuerySnapshot> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
    {
        return new AutoFirestoreChangeListener(reference, callback, postFaultCallback, cancellationToken);
    }
    public static FirestoreChangeListener AutoListen(this Query query, Func<QuerySnapshot, CancellationToken, Task> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
    {
        return new AutoFirestoreChangeListener(query, callback, postFaultCallback, cancellationToken);
    }
    public static FirestoreChangeListener AutoListen(this Query query, Action<QuerySnapshot> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
    {
        return new AutoFirestoreChangeListener(query, callback, postFaultCallback, cancellationToken);
    }

    #endregion
    #region Nested

    private sealed class AutoFirestoreChangeListener
    {
        #region Fields

        private readonly Action _postFaultCallback;
        private FirestoreChangeListener _currentListener;

        #endregion
        #region Constructors

        public AutoFirestoreChangeListener(DocumentReference documentReference, Func<DocumentSnapshot, CancellationToken, Task> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
        {
            if (documentReference == null) throw new ArgumentNullException(nameof(documentReference));
            if (callback == null) throw new ArgumentNullException(nameof(callback));

            _postFaultCallback = postFaultCallback;

            StartDocumentListener(documentReference, callback, cancellationToken);
        }
        public AutoFirestoreChangeListener(DocumentReference documentReference, Action<DocumentSnapshot> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
        {
            if (documentReference == null) throw new ArgumentNullException(nameof(documentReference));
            if (callback == null) throw new ArgumentNullException(nameof(callback));

            _postFaultCallback = postFaultCallback;

            StartDocumentListener(documentReference, callback, cancellationToken);
        }
        public AutoFirestoreChangeListener(CollectionReference collectionReference, Action<QuerySnapshot> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
        {
            if (collectionReference == null) throw new ArgumentNullException(nameof(collectionReference));
            if (callback == null) throw new ArgumentNullException(nameof(callback));

            _postFaultCallback = postFaultCallback;

            StartCollectionListener(collectionReference, callback, cancellationToken);
        }
        public AutoFirestoreChangeListener(CollectionReference collectionReference, Func<QuerySnapshot, CancellationToken, Task> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
        {
            if (collectionReference == null) throw new ArgumentNullException(nameof(collectionReference));
            if (callback == null) throw new ArgumentNullException(nameof(callback));

            _postFaultCallback = postFaultCallback;

            StartCollectionListener(collectionReference, callback, cancellationToken);
        }
        public AutoFirestoreChangeListener(Query query, Action<QuerySnapshot> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
        {
            if (query == null) throw new ArgumentNullException(nameof(query));
            if (callback == null) throw new ArgumentNullException(nameof(callback));

            _postFaultCallback = postFaultCallback;

            StartQueryListener(query, callback, cancellationToken);
        }
        public AutoFirestoreChangeListener(Query query, Func<QuerySnapshot, CancellationToken, Task> callback, Action postFaultCallback = null, CancellationToken cancellationToken = default(CancellationToken))
        {
            if (query == null) throw new ArgumentNullException(nameof(query));
            if (callback == null) throw new ArgumentNullException(nameof(callback));

            _postFaultCallback = postFaultCallback;

            StartQueryListener(query, callback, cancellationToken);
        }

        #endregion
        #region Methods

        public static implicit operator FirestoreChangeListener(AutoFirestoreChangeListener value)
        {
            return value?._currentListener;
        }
        private void StartCollectionListener(CollectionReference collectionReference, Func<QuerySnapshot, CancellationToken, Task> callback, CancellationToken cancellationToken)
        {
            _currentListener = collectionReference.Listen(callback, cancellationToken);
            _currentListener.ListenerTask.ContinueWith(task =>
                {
                    if (!task.IsFaulted) return;

                    _postFaultCallback?.Invoke();
                    StartCollectionListener(collectionReference, callback, cancellationToken);
                },
                cancellationToken);
        }
        private void StartCollectionListener(CollectionReference collectionReference, Action<QuerySnapshot> callback, CancellationToken cancellationToken)
        {
            _currentListener = collectionReference.Listen(callback, cancellationToken);
            _currentListener.ListenerTask.ContinueWith(task =>
                {
                    if (!task.IsFaulted) return;

                    _postFaultCallback?.Invoke();
                    StartCollectionListener(collectionReference, callback, cancellationToken);
                },
                cancellationToken);
        }
        private void StartDocumentListener(DocumentReference documentReference, Func<DocumentSnapshot, CancellationToken, Task> callback, CancellationToken cancellationToken)
        {
            _currentListener = documentReference.Listen(callback, cancellationToken);
            _currentListener.ListenerTask.ContinueWith(task =>
                {
                    if (!task.IsFaulted) return;

                    _postFaultCallback?.Invoke();
                    StartDocumentListener(documentReference, callback, cancellationToken);
                },
                cancellationToken);
        }
        private void StartDocumentListener(DocumentReference documentReference, Action<DocumentSnapshot> callback, CancellationToken cancellationToken)
        {
            _currentListener = documentReference.Listen(callback, cancellationToken);
            _currentListener.ListenerTask.ContinueWith(task =>
                {
                    if (!task.IsFaulted) return;

                    _postFaultCallback?.Invoke();
                    StartDocumentListener(documentReference, callback, cancellationToken);
                },
                cancellationToken);
        }
        private void StartQueryListener(Query query, Func<QuerySnapshot, CancellationToken, Task> callback, CancellationToken cancellationToken)
        {
            _currentListener = query.Listen(callback, cancellationToken);
            _currentListener.ListenerTask.ContinueWith(task =>
                {
                    if (!task.IsFaulted) return;

                    _postFaultCallback?.Invoke();
                    StartQueryListener(query, callback, cancellationToken);
                },
                cancellationToken);
        }
        private void StartQueryListener(Query query, Action<QuerySnapshot> callback, CancellationToken cancellationToken)
        {
            _currentListener = query.Listen(callback, cancellationToken);
            _currentListener.ListenerTask.ContinueWith(task =>
                {
                    if (!task.IsFaulted) return;

                    _postFaultCallback?.Invoke();
                    StartQueryListener(query, callback, cancellationToken);
                },
                cancellationToken);
        }

        #endregion
    }

    #endregion
}

jskeet · 2019-08-20T16:45:58Z

I've spoken with the Firestore team about this, and we'll wait until we hear more users requesting this feature so that we can make sure that the design handles a wide selection of use cases.
The underlying protocol already allows resuming from a timestamp, so it's possible that we'd do that instead - but it makes it tricky to know what's changed.

#3381 adds this feature request to the backlog, so I'll close this issue. Many thanks for all the thoughtful comments though - even if we can't implement all feature requests, it's great to have users who are so involved.

jskeet self-assigned this Sep 17, 2018

jskeet added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 17, 2018

jskeet changed the title ~~Firestore: FirestoreChangeListener does not handle network outage~~ Firestore: Expose sufficient state to manually restart a Listen operation Sep 21, 2018

jskeet added the api: firestore Issues related to the Firestore API. label Apr 17, 2019

jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Aug 20, 2019

Add to googleapis#2513 to backlog

72fb4e7

jskeet closed this as completed Aug 20, 2019

jskeet added a commit that referenced this issue Aug 20, 2019

Add to #2513 to backlog

c377588

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Firestore: Expose sufficient state to manually restart a Listen operation #2513

Firestore: Expose sufficient state to manually restart a Listen operation #2513

pauloevpr commented Sep 17, 2018 •

edited by jskeet

Loading

jskeet commented Sep 17, 2018

pauloevpr commented Sep 18, 2018

jskeet commented Sep 18, 2018

pauloevpr commented Sep 20, 2018 •

edited

Loading

pauloevpr commented Sep 21, 2018 •

edited

Loading

pauloevpr commented Sep 21, 2018

jskeet commented Sep 21, 2018 •

edited

Loading

pauloevpr commented Sep 21, 2018 •

edited

Loading

jskeet commented Aug 20, 2019 •

edited

Loading

Firestore: Expose sufficient state to manually restart a Listen operation #2513

Firestore: Expose sufficient state to manually restart a Listen operation #2513

Comments

pauloevpr commented Sep 17, 2018 • edited by jskeet Loading

jskeet commented Sep 17, 2018

pauloevpr commented Sep 18, 2018

jskeet commented Sep 18, 2018

pauloevpr commented Sep 20, 2018 • edited Loading

pauloevpr commented Sep 21, 2018 • edited Loading

pauloevpr commented Sep 21, 2018

jskeet commented Sep 21, 2018 • edited Loading

pauloevpr commented Sep 21, 2018 • edited Loading

jskeet commented Aug 20, 2019 • edited Loading

pauloevpr commented Sep 17, 2018 •

edited by jskeet

Loading

pauloevpr commented Sep 20, 2018 •

edited

Loading

pauloevpr commented Sep 21, 2018 •

edited

Loading

jskeet commented Sep 21, 2018 •

edited

Loading

pauloevpr commented Sep 21, 2018 •

edited

Loading

jskeet commented Aug 20, 2019 •

edited

Loading