[Bug Report] Microsoft.Azure.Devices.Client can throw uncaught NREs causing applications to crash #2107

ccruden-shookiot · 2021-05-13T14:13:15Z

ccruden-shookiot
May 13, 2021

Context

OS, version, SKU and CPU architecture used: Windows Server 2016 Datacenter
Application's .NET Target Framework : .NET Core 5.0.1
Device: Azure Virtual Machine
SDK version used: Microsoft.Azure.Devices.Client 1.36.0

Description of the issue

MSADC can throw an uncaught, uncatchable null reference exception which causes applications to crash. This happens infrequently - on the order of once every couple of weeks for applications continually transmitting roughly 30-60G/day - but often enough to increase downtime on applications that are meant to have near 100% uptime.

Event viewer logs the following:
Application: StoreAndForward.exe
CoreCLR Version: 5.0.120.57516
.NET Version: 5.0.1
Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Azure.Devices.Client.Transport.Mqtt.OrderedTwoPhaseWorkQueue`2.Abort(Exception exception)
at Microsoft.Azure.Devices.Client.Transport.Mqtt.MqttIotHubAdapter.ShutdownOnErrorAsync(IChannelHandlerContext context, Exception exception)
at System.Threading.Tasks.Task.<>c.b__140_1(Object state)
at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()

ccruden-shookiot · 2021-05-13T14:15:24Z

ccruden-shookiot
May 13, 2021
Author

The relevant bits of the class which make use of DeviceClient are:

    public IotHubClient(ILogger<IotHubClient> log, IOptionsMonitor<Configuration> optionsMonitor)
    {
        _log = log;
        _connString = optionsMonitor.CurrentValue.IOTHubConnectionString;
        _ioHubLock = new SemaphoreSlim(1, 1);
        IsSending = true; //so that we at least try once to process some files on startup            
    }
    private async Task ConnectToHub()
    {
        if (_IOTHubClient == null)
        {
            try
            {
                await _ioHubLock.WaitAsync();
                if (_IOTHubClient == null)
                {
                    _log.LogDebug($"IOTHubConsumer Store And Forward created with string: {_connString}");
                    DeviceClient client = DeviceClient.CreateFromConnectionString(_connString, TransportType.Mqtt);
                    await client.OpenAsync();
                    IRetryPolicy retryPolicy = new ExponentialBackoff(int.MaxValue, TimeSpan.FromMilliseconds(100), TimeSpan.FromSeconds(10), TimeSpan.FromMilliseconds(100));
                    client.SetRetryPolicy(retryPolicy);
                    client.SetConnectionStatusChangesHandler(ConnectionStatusChangedHandler);
                    _IOTHubClient = client;
                    _log.LogDebug("Successfully opened IOT hub");
                    firstDisabled = DateTime.MaxValue;
                    IsSending = true;
                }
            }
            catch (Exception ex)
            {
                _log.LogError(ex, "Exception opening IOT hub");
                _IOTHubClient = null;
                throw;
            }
            finally
            {
                _ioHubLock.Release();
            }
        }
    }

    private Task DisconnectClient()
    {
        IsSending = false;
        DeviceClient local = _IOTHubClient;
        _IOTHubClient = null;
        local.Dispose();
        return Task.CompletedTask;
    }

    private Task Bail()
    {
        IsSending = false;
        UnrecoverableError?.Invoke(this, new EventArgs());
        return Task.CompletedTask;
    }

    private async void ConnectionStatusChangedHandler(ConnectionStatus status, ConnectionStatusChangeReason reason)
    {
        switch (status)
        {
            case ConnectionStatus.Connected:
                _log.LogDebug("Successfully connected to IOT hub");
                IsSending = true;
                break;

            case ConnectionStatus.Disconnected_Retrying:
                _log.LogDebug("IOT hub disconnected - retrying.");
                IsSending = false;
                break;

            case ConnectionStatus.Disabled:
                _log.LogInformation("IOT hub disabled - reconnecting manually.");
                await DisconnectClient();
                break;

            case ConnectionStatus.Disconnected:
                switch (reason)
                {
                    case ConnectionStatusChangeReason.Bad_Credential:
                        _log.LogInformation("IOT hub credentials rejected. Retrying.");
                        try
                        {
                            DeviceClient client = DeviceClient.CreateFromConnectionString(_connString, TransportType.Mqtt);
                            await client.OpenAsync();
                            client.SetConnectionStatusChangesHandler(ConnectionStatusChangedHandler);
                            IRetryPolicy retryPolicy = new ExponentialBackoff(int.MaxValue, TimeSpan.FromMilliseconds(100), TimeSpan.FromSeconds(10), TimeSpan.FromMilliseconds(100));
                            client.SetRetryPolicy(retryPolicy);
                            await _ioHubLock.WaitAsync();
                            _IOTHubClient = client;
                            _log.LogInformation("Retry successful.");
                            _ioHubLock.Release();
                        }
                        catch (Exception ex)
                        {
                            _log.LogCritical(ex, "Exception re-establishing connection.");
                            await Bail();
                        }
                        break;

                    case ConnectionStatusChangeReason.Client_Close:
                        _log.LogDebug("IOT hub closed gracefully.");
                        break;

                    case ConnectionStatusChangeReason.Communication_Error:
                        _log.LogInformation("IOT hub disconnected because of non-retryable exception. Restarting.");
                        await DisconnectClient();
                        break;

                    case ConnectionStatusChangeReason.Device_Disabled:
                        // This is gonna lose data, but at this ponit that's unavoidable.
                        if (firstDisabled - DateTime.Now > TimeSpan.FromMinutes(5))
                        {
                            _log.LogCritical($"The IOT hub device for the store and forward has been disabled or deleted for more than 5 minutes. Aborting.");
                            await Bail();
                        }
                        else
                        {
                            // Hopefully the device has just been temporarily disabled and it will be enabled before the data piling up kills us off.
                            if (firstDisabled == DateTime.MaxValue)
                            {
                                firstDisabled = DateTime.Now;
                            }
                            _log.LogCritical($"The IOT hub device for the store and forward has been disabled or deleted. Retrying, but it's not looking good.");
                            await DisconnectClient();
                        }
                        break;

                    case ConnectionStatusChangeReason.Retry_Expired:
                        _log.LogInformation("IOT hub disconnected because retry expired. Retrying more forcibly.");
                        await DisconnectClient();
                        break;

                    // No_Network is not used.
                    case ConnectionStatusChangeReason.No_Network:
                    // Expired_SAS_Token is not used
                    case ConnectionStatusChangeReason.Expired_SAS_Token:
                    default:
                        _log.LogError($"ConnectionStatus {status}, ConnectionStatusChangeReason {reason}: this should never happen. Contact support if you see this message.");
                        break;
                }
                break;

            default:
                _log.LogError($"ConnectionStatus {status}, ConnectionStatusChangeReason {reason}: this should never happen. Contact support if you see this message.");
                break;
        }
    }

    public async Task SendToHub(IQueuedMessage message)
    {
        try
        {
            await ConnectToHub();
            byte[] messageBytes = await message.GetBytes();
            if (messageBytes.Length > 0)
            {
                using Message msg = new(messageBytes); 
                await _IOTHubClient.SendEventAsync(msg);
                _log.LogDebug($"Sent {messageBytes.Length} byte message to IoTHub");
                Interlocked.Increment(ref messagesSent);
                Interlocked.Add(ref bytesSent, messageBytes.Length);
            }
            await message.MarkProcessed();
        }
        catch (Exception ex)
        {
            _log.LogWarning(ex, "Error encountered while sending data to IoTHub");
            await message.Requeue();
        }
    }

0 replies

abhipsaMisra · 2021-05-13T16:26:44Z

abhipsaMisra
May 13, 2021
Maintainer

Thanks for bringing this to our attention. An unexplained NRE is always undesirable, so we'll definitely get this looked at.

We'll use the code you've shared to reproduce this at our end, in the meanwhile, would it be possible for you to enable logging and share those with us: https://github.com/Azure/azure-iot-sdk-csharp/tree/master/tools/CaptureLogs
I assume that this issue occurs with a regular device send long-running operation, without any external changes in the device's state outside of regular network disconnections (nothing wrt to credentials, device availability (enabled/ disabled) changes; would I be correct in that assumption?

Couple of points on the code:

ExponentialBackoff is the sdk's default retry policy, so you do not need to set it explicitly.
The exception block for the ConnectToHub method call initializes the client to null in case of an exception on calling OpenAsync(). Separately, the DisconnectClient then disposes the client on disconnection. I would recommend adding a null check before disposing the client, to avoid scenarios where you might end up calling dispose on a null client (DeviceClient.OpenAsync() threw an exception resulting in _IOTHubClient being assigned null, and then the client reported a disconnected state resulting in DisconnectClient being called.)
Include a call to DeviceClient.CloseAsync() before calling Dispose().

private async Task DisconnectClientAsync()
{
	IsSending = false;

	if (_IOTHubClient != null)
	{
		await _IOTHubClient.CloseAsync();
		_IOTHubClient.Dispose();
		_IOTHubClient = null;
	}
}

0 replies

ccruden-shookiot · 2021-05-13T19:43:25Z

ccruden-shookiot
May 13, 2021
Author

Logging with logman seems to be generating about 1MB / minute log on one of our lower bandwidth instances. Given that reproducing the problem takes ... a while, and I'm not sure I can truncate the log files to a particular day, this is likely to generate dozens of GB of log before anything happens. Are you OK with that, or do you have suggestions for ways that may reduce the disk space consumption? (Apart from logging in every once in a while, stopping the trace, deleting the log, and starting it up again.)

You're correct in your assumption: credentials do not change, device availability does not change. We've occasionally had cases where the IOT hub has run out of allocated messages for the day, and brief network outages, but that happens more rarely than the crash. The application itself is a store and forward service - it picks up messages from a queue on disk and sends them on to the IOT hub, then deletes them when they're successfully transmitted, however long that takes. There is never more than 10k messages pending transmission at any one time.

I'll add the suggested changes to the code.

0 replies

abhipsaMisra · 2021-05-13T22:13:53Z

abhipsaMisra
May 13, 2021
Maintainer

Logman supports circular logging, you could use that to retain the last x mb of log data. If you have recovery code added to your application, maybe you could monitor for the NRE and stop log collection at that point. Otherwise, your application (and logman) should stop on hitting the exception.

logman create trace IotTrace -o iot.etl -pf iot_providers.txt -f bincirc -max 1000

Iot_providers is in /tools/CaptureLogs
The max parameter is max file size in MB

0 replies

jamdavi · 2021-05-14T18:14:18Z

jamdavi
May 14, 2021

@ccruden-shookiot I thought I recognized your name. Attaching #1737 for reference.

0 replies

ccruden-shookiot · 2021-05-14T19:35:11Z

ccruden-shookiot
May 14, 2021
Author

@abhipsaMisra Thanks for the logging help - rolling logs in place at all sites. Unfortunately we haven't found a way to handle or even catch the NRE. It's not caught by the task scheduler's unobserved task handler and because it's triggered on a callback we can't wrap it.

0 replies

ccruden-shookiot · 2021-05-14T19:46:41Z

ccruden-shookiot
May 14, 2021
Author

@abhipsaMisra Actually, it seems when I turn on logging with the full set of providers in tools/CaptureLogs, transmission of data is slowed down so much that it can't keep up with the rate data is being generated at. Is it going to affect things if I trace just the entries for MS.Azure.Devices.Client? (I believe that's ddbee999-a79e-5050-ea3c-6d1a8a7bafdd)?

0 replies

abhipsaMisra · 2021-05-15T00:25:57Z

abhipsaMisra
May 15, 2021
Maintainer

@ccruden-shookiot You should be ok limiting the trace providers to the DeviceClient traces (ddbee999-a79e-5050-ea3c-6d1a8a7bafdd), that should be enough for us to investigate the mqtt layer operations.

0 replies

abhipsaMisra · 2021-05-15T00:27:17Z

abhipsaMisra
May 15, 2021
Maintainer

Actually, could you add the mqtt trace provider as well, since we'll need that information (d079e771-0495-4124-bd2f-ab63c2b50525).

0 replies

ccruden-shookiot · 2021-05-15T00:58:21Z

ccruden-shookiot
May 15, 2021
Author

Hate to say it, but I don't think I can enable that one. If I do - even if that's the only trace provider I use - data gets generated faster than it's transmitted.

0 replies

ccruden-aspire · 2021-06-08T17:09:55Z

ccruden-aspire
Jun 8, 2021

We'll use the code you've shared to reproduce this at our end, in the meanwhile, would it be possible for you to enable logging and share those with us: https://github.com/Azure/azure-iot-sdk-csharp/tree/master/tools/CaptureLogs

Finally captured a trace from this crash. How would you like it sent to you? It's a 55M file, zipped, and doesn't seem to want to be attached to a reply here.

0 replies

azabbasi · 2021-06-08T23:03:34Z

azabbasi
Jun 8, 2021

@ccruden-uptake
That is really good news as we have not been able to reproduce the issue on our side so far.
However you like to share the file with us would be fine, you can either upload it to your favorite cloud storage and share the link with us or alternatively store the file in your own github repo and make it available to us

https://git-lfs.github.com/

0 replies

ccruden-aspire · 2021-06-08T23:51:01Z

ccruden-aspire
Jun 8, 2021

I've shared a link to the file on OneDrive with @jamdavi as I had his email address. (While I don't think the trace contains anything client confidential, I figure better safe than sorry.) Hopefully you can pick it up from him....

0 replies

azabbasi · 2021-06-15T16:51:50Z

azabbasi
Jun 15, 2021

@ccruden-uptake
Thanks for providing the logs, I see that you have captured the DeviceClient provider, is there a chance you can also capture the MQTT provider as well?
d079e771-0495-4124-bd2f-ab63c2b50525

0 replies

ccruden-aspire · 2021-06-15T17:05:09Z

ccruden-aspire
Jun 15, 2021

@azabbasi Sorry, no. As mentioned above, if I start capturing the MQTT provider, the program can no longer keep up with the data it's being fed.

0 replies

azabbasi · 2021-06-28T17:08:59Z

azabbasi
Jun 28, 2021

Unfortunately, the provided ETL logs are missing some metadata and after plenty of attempts, we failed to find the root cause. We are deploying a small app to a long-running agent to try and repro the issue ourselves again.
The issue with the captured logs is a known issue that sometimes happens with circular logging and the metadata is lost. We are actively working on this issue and hope to find the cause ASAP.

0 replies

ccruden-aspire · 2021-06-28T17:24:46Z

ccruden-aspire
Jun 28, 2021

We'll keep logging then and send you the logs when there's another crash? Hopefully the metadata will show up in one of them....

0 replies

azabbasi · 2021-06-28T17:27:33Z

azabbasi
Jun 28, 2021

That would be ideal, we will also keep running the app on our side to try and repro the behavior.

0 replies

ccruden-aspire · 2022-01-11T23:18:55Z

ccruden-aspire
Jan 11, 2022

@azabbasi I believe we now have traces from a crash which include both the MSADC and MQTT providers. It's about a 6M file, zipped. Assuming I'd rather not attach it directly here on the off chance it contains customer sensitive data, how can I send it to you?

0 replies

azabbasi · 2022-02-07T17:10:14Z

azabbasi
Feb 7, 2022

@ccruden-uptake
You can send it to me directly to azabbasi at microsoft.com and I will look into it immediately.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Microsoft.Azure.Devices.Client can throw uncaught NREs causing applications to crash #2107

{{title}}

Replies: 20 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Bug Report] Microsoft.Azure.Devices.Client can throw uncaught NREs causing applications to crash #2107

Context

Description of the issue

Replies: 20 comments

ccruden-shookiot May 13, 2021 Author

abhipsaMisra May 13, 2021 Maintainer

ccruden-shookiot May 13, 2021 Author

abhipsaMisra May 13, 2021 Maintainer

ccruden-shookiot May 14, 2021 Author

ccruden-shookiot May 14, 2021 Author

abhipsaMisra May 15, 2021 Maintainer

abhipsaMisra May 15, 2021 Maintainer

ccruden-shookiot May 15, 2021 Author

ccruden-shookiot
May 13, 2021
Author

abhipsaMisra
May 13, 2021
Maintainer

ccruden-shookiot
May 13, 2021
Author

abhipsaMisra
May 13, 2021
Maintainer

ccruden-shookiot
May 14, 2021
Author

ccruden-shookiot
May 14, 2021
Author

abhipsaMisra
May 15, 2021
Maintainer

abhipsaMisra
May 15, 2021
Maintainer

ccruden-shookiot
May 15, 2021
Author