[Bug Report] Microsoft.Azure.Devices.Client can throw uncaught NREs causing applications to crash #2107
Replies: 20 comments
-
The relevant bits of the class which make use of DeviceClient are:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for bringing this to our attention. An unexplained NRE is always undesirable, so we'll definitely get this looked at. We'll use the code you've shared to reproduce this at our end, in the meanwhile, would it be possible for you to enable logging and share those with us: https://github.com/Azure/azure-iot-sdk-csharp/tree/master/tools/CaptureLogs Couple of points on the code:
private async Task DisconnectClientAsync()
{
IsSending = false;
if (_IOTHubClient != null)
{
await _IOTHubClient.CloseAsync();
_IOTHubClient.Dispose();
_IOTHubClient = null;
}
} |
Beta Was this translation helpful? Give feedback.
-
Logging with logman seems to be generating about 1MB / minute log on one of our lower bandwidth instances. Given that reproducing the problem takes ... a while, and I'm not sure I can truncate the log files to a particular day, this is likely to generate dozens of GB of log before anything happens. Are you OK with that, or do you have suggestions for ways that may reduce the disk space consumption? (Apart from logging in every once in a while, stopping the trace, deleting the log, and starting it up again.) You're correct in your assumption: credentials do not change, device availability does not change. We've occasionally had cases where the IOT hub has run out of allocated messages for the day, and brief network outages, but that happens more rarely than the crash. The application itself is a store and forward service - it picks up messages from a queue on disk and sends them on to the IOT hub, then deletes them when they're successfully transmitted, however long that takes. There is never more than 10k messages pending transmission at any one time. I'll add the suggested changes to the code. |
Beta Was this translation helpful? Give feedback.
-
Logman supports circular logging, you could use that to retain the last x mb of log data. If you have recovery code added to your application, maybe you could monitor for the NRE and stop log collection at that point. Otherwise, your application (and logman) should stop on hitting the exception.
|
Beta Was this translation helpful? Give feedback.
-
@ccruden-shookiot I thought I recognized your name. Attaching #1737 for reference. |
Beta Was this translation helpful? Give feedback.
-
@abhipsaMisra Thanks for the logging help - rolling logs in place at all sites. Unfortunately we haven't found a way to handle or even catch the NRE. It's not caught by the task scheduler's unobserved task handler and because it's triggered on a callback we can't wrap it. |
Beta Was this translation helpful? Give feedback.
-
@abhipsaMisra Actually, it seems when I turn on logging with the full set of providers in tools/CaptureLogs, transmission of data is slowed down so much that it can't keep up with the rate data is being generated at. Is it going to affect things if I trace just the entries for MS.Azure.Devices.Client? (I believe that's ddbee999-a79e-5050-ea3c-6d1a8a7bafdd)? |
Beta Was this translation helpful? Give feedback.
-
@ccruden-shookiot You should be ok limiting the trace providers to the DeviceClient traces (ddbee999-a79e-5050-ea3c-6d1a8a7bafdd), that should be enough for us to investigate the mqtt layer operations. |
Beta Was this translation helpful? Give feedback.
-
Actually, could you add the mqtt trace provider as well, since we'll need that information (d079e771-0495-4124-bd2f-ab63c2b50525). |
Beta Was this translation helpful? Give feedback.
-
Hate to say it, but I don't think I can enable that one. If I do - even if that's the only trace provider I use - data gets generated faster than it's transmitted. |
Beta Was this translation helpful? Give feedback.
-
Finally captured a trace from this crash. How would you like it sent to you? It's a 55M file, zipped, and doesn't seem to want to be attached to a reply here. |
Beta Was this translation helpful? Give feedback.
-
@ccruden-uptake |
Beta Was this translation helpful? Give feedback.
-
I've shared a link to the file on OneDrive with @jamdavi as I had his email address. (While I don't think the trace contains anything client confidential, I figure better safe than sorry.) Hopefully you can pick it up from him.... |
Beta Was this translation helpful? Give feedback.
-
@ccruden-uptake |
Beta Was this translation helpful? Give feedback.
-
@azabbasi Sorry, no. As mentioned above, if I start capturing the MQTT provider, the program can no longer keep up with the data it's being fed. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately, the provided ETL logs are missing some metadata and after plenty of attempts, we failed to find the root cause. We are deploying a small app to a long-running agent to try and repro the issue ourselves again. |
Beta Was this translation helpful? Give feedback.
-
We'll keep logging then and send you the logs when there's another crash? Hopefully the metadata will show up in one of them.... |
Beta Was this translation helpful? Give feedback.
-
That would be ideal, we will also keep running the app on our side to try and repro the behavior. |
Beta Was this translation helpful? Give feedback.
-
@azabbasi I believe we now have traces from a crash which include both the MSADC and MQTT providers. It's about a 6M file, zipped. Assuming I'd rather not attach it directly here on the off chance it contains customer sensitive data, how can I send it to you? |
Beta Was this translation helpful? Give feedback.
-
@ccruden-uptake |
Beta Was this translation helpful? Give feedback.
-
Context
Description of the issue
MSADC can throw an uncaught, uncatchable null reference exception which causes applications to crash. This happens infrequently - on the order of once every couple of weeks for applications continually transmitting roughly 30-60G/day - but often enough to increase downtime on applications that are meant to have near 100% uptime.
Event viewer logs the following:
Application: StoreAndForward.exe
CoreCLR Version: 5.0.120.57516
.NET Version: 5.0.1
Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Azure.Devices.Client.Transport.Mqtt.OrderedTwoPhaseWorkQueue`2.Abort(Exception exception)
at Microsoft.Azure.Devices.Client.Transport.Mqtt.MqttIotHubAdapter.ShutdownOnErrorAsync(IChannelHandlerContext context, Exception exception)
at System.Threading.Tasks.Task.<>c.b__140_1(Object state)
at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
Beta Was this translation helpful? Give feedback.
All reactions