Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] NullReferenceException in AuthenticationWithTokenRefresh #1911

Closed
its-phil opened this issue Apr 26, 2021 · 16 comments
Closed

[Bug Report] NullReferenceException in AuthenticationWithTokenRefresh #1911

its-phil opened this issue Apr 26, 2021 · 16 comments
Assignees
Labels
bug Something isn't working. fix-checked-in Fix checked into main or preview, but not yet released. IoTSDK Tracks all IoT SDK issues across the board

Comments

@its-phil
Copy link

Context

  • OS, version, SKU and CPU architecture used: Windows 10 Desktop x64
  • Application's .NET Target Framework : netcoreapp3.1, net5.0
  • Device: Laptop
  • SDK version used: 1.36.0

Description of the issue

I was testing the DeviceClient connection using TPM for attestation when I stumbled across a NullReferenceException.
The desired sequence is:

  1. Create a DeviceClient instance for direct to the IoT Hub (no gateway).
  2. Disconnect the first DeviceClient and connect again, this time via an Edge gateway.

If the IoT hub address is valid it's all good. However, if for some reason the IoT Hub address is invalid (e.g. not reachable), i get said NullReferenceException in the second step.

Code sample exhibiting the issue

See this repo for a demo: https://github.com/its-phil/GatewayTest

I debugged into the SDK code and found that the exception occurs in AuthenticationWithTokenRefresh.cs in line 78 (Debug.Assert(_lock != null);).

Am I using the SDK wrong or might this be an issue in the SDK?

Console log of the issue

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at Microsoft.Azure.Devices.Client.AuthenticationWithTokenRefresh.GetTokenAsync(String iotHub)
   at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsTokenProvider.GetTokenAsync(Uri namespaceAddress, String appliesTo, String[] requiredClaims)
   at Microsoft.Azure.Amqp.TaskHelpers.EndAsyncResult(IAsyncResult asyncResult)
   at Microsoft.Azure.Amqp.IteratorAsyncResult`1.<>c.<CallTask>b__24_1(TIteratorAsyncResult thisPtr, IAsyncResult r)
   at Microsoft.Azure.Amqp.IteratorAsyncResult`1.StepCallback(IAsyncResult result)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Amqp.ExceptionDispatcher.Throw(Exception exception)
   at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
   at Microsoft.Azure.Amqp.AsyncResult`1.End(IAsyncResult asyncResult)
   at Microsoft.Azure.Amqp.AmqpCbsLink.EndSendToken(IAsyncResult result)
   at Microsoft.Azure.Amqp.AmqpCbsLink.<>c__DisplayClass4_0.<SendTokenAsync>b__1(IAsyncResult a)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout)
   at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpAuthenticationRefresher.InitLoopAsync(TimeSpan timeout)
   at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTConnection.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout)
   at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout)
   at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionIsOpenAsync(TimeSpan timeout)
   at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.OpenAsync(TimeSpan timeout)
   at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>c__DisplayClass27_0.<<ExecuteWithErrorHandlingAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation)
   at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass38_0.<<OpenInternalAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken)
   at GatewayTest.Program.Main(String[] args) in W:\GatewayTest\Program.cs:line 35
   at GatewayTest.Program.<Main>(String[] args)
@its-phil its-phil added the bug Something isn't working. label Apr 26, 2021
@github-actions github-actions bot added the IoTSDK Tracks all IoT SDK issues across the board label Apr 26, 2021
@abhipsaMisra
Copy link
Member

Thanks for reporting this @its-phil . The debug assert referenced asserts that the lock obtained during token generation is valid. You hitting a NullReferenceException indicates that the lock has possibly been disposed at some point and is not ready for reuse.

Could you explain a bit about this operation that you're carrying out:

Disconnect the first DeviceClient and connect again, this time via an Edge gateway.
If the IoT hub address is valid it's all good. However, if for some reason the IoT Hub address is invalid (e.g. not reachable), i get said NullReferenceException in the second step.

Do you mean that you call CloseAsync() and then OpenAsync() again, or do you mean that you dispose the existing client instance and reinitialize it with the credentials required to connect to an Edge gateway?
A note here that reopening a closed client instance is not supported, since the transport layer is destroyed on calling close. To reuse a closed client instance you'll need to dispose it first and then reinitialize it again.

Also, do you hit this NullReferenceException if you try to connect to an invalid IoT Hub address right from the beginning?

@its-phil
Copy link
Author

Thanks. So here's the normal sequence of operation I'd like to achieve:

Preconditions:

  • The TPM's endorsement key is registered in an individual enrollment on the DPS.
  • There's an IoT device entry on the IoT Hub for the device (created by the DPS) and the Edge is set as parent for the device.
  • Device and Edge are connected to the internet.
  1. The device connects to the DPS at least once to properly enroll. The device stores the IoT hub address it gets from the DPS.
  2. The device connects to the IoT Hub directly to fetch desired properties and disconnects from the IoT Hub after this. The software calls CloseAsync and disposes the DeviceClient instance because it is created in a using statement.
  3. After that the device connects to the IoT Hub via the gateway (OpenAsync with the IoT Hub address and gateway address). It uses the same DeviceAuthenticationWithTpm instance as in step 2 for this.

So far, this procedure works as expected. All instances can be connected correctly. Now the preconditions change:

  • Device and Edge are not connected to the internet.

After steps 1 through 3 ran successfully at least once the device can communicate with the Edge even if both are not connected to the internet any longer. This is where the example I posted comes in: with the example I simulated that the device has an error in its stored IoT Hub address. Note that in the example I omitted the DPS connection attempt because it wouldn't work in the offline case anyway.

I admit, this scenario it's a bit artificial, but I wanted to see what happens anyway. I am expecting an error because of the invalid IoT Hub address, so the behavior I observed is not a real issue for me. But the NullReferenceException was somewhat unexpected.

@abhipsaMisra abhipsaMisra self-assigned this Apr 28, 2021
@abhipsaMisra
Copy link
Member

Thank you for the detailed explanation @its-phil . You are correct, connecting to an invalid IoT Hub endpoint should have returned an exception of type IotHubException and not a NullReferenceException. I'll investigate this.

@abhipsaMisra
Copy link
Member

abhipsaMisra commented Apr 30, 2021

@its-phil I have not been able to reproduce the above scenario. If I connect to an invalid IoT Hub endpoint, I get a socket exception saying that the host couldn't be resolved. On connecting to a different IoT hub endpoint returns an exception that says that the device registration couldn't be found (which isn't the scenario that you mentioned, but I just wanted to point it out).

Would you be able to share the complete device logs for your scenario: https://github.com/Azure/azure-iot-sdk-csharp/tree/master/tools/CaptureLogs

@abhipsaMisra
Copy link
Member

Also, in my tests I've been connecting directly to hub, without any gateway in between (even though the sample uses localhost as your local gateway): https://github.com/its-phil/GatewayTest/blob/master/Program.cs#L36.
Do you see the same behavior at your end if you remove local gateway from the client initializer?

@abhipsaMisra abhipsaMisra added the customer-input-needed Issue lacks enough data for a proper investigation. label May 3, 2021
@its-phil
Copy link
Author

its-phil commented May 5, 2021

@abhipsaMisra Thanks for testing.

If I connect to an invalid IoT Hub endpoint, I get a socket exception saying that the host couldn't be resolved.

That's what I would have expected, too. I attached a trace log. Does that help you?
2021-05-05 GatewayTest_traces.zip

Also, just FYI: I run the gateway on my local machine using iotedgehubdev. That's why my gateway hostname is localhost.

Also, in my tests I've been connecting directly to hub, without any gateway in between (even though the sample uses localhost as your local gateway)

I ran the test program without the gateway hostname parameter in https://github.com/its-phil/GatewayTest/blob/master/Program.cs#L36 (i.e. trying to connect to the IoT Hub directly once more after trying to do so in L24). The program doesn't crash with a NullReferenceException. Instead, I get the following (understandable) exception:

Unhandled exception. Microsoft.Azure.Devices.Client.Exceptions.IotHubCommunicationException: Transient network error occurred, please retry.
 ---> System.OperationCanceledException: The operation was canceled.
   at System.Threading.CancellationToken.ThrowOperationCanceledException()
   at System.Threading.CancellationToken.ThrowIfCancellationRequested()
   at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>c__DisplayClass27_0.<<ExecuteWithErrorHandlingAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation)
   --- End of inner exception stack trace ---
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation)
   at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass38_0.<<OpenInternalAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken)

@abhipsaMisra
Copy link
Member

Thanks for the confirmation @its-phil ; it looks like the gateway hostname route is the one that is broken here. I'll try to repro this with a gateway in between and investigate it.

@abhipsaMisra abhipsaMisra removed the customer-input-needed Issue lacks enough data for a proper investigation. label May 10, 2021
@abhipsaMisra
Copy link
Member

I believe I was able to reproduce the issue that you're hitting:

The DeviceAuthenticationWithTpm instance that you supplied to the client initialization method is IDisposable, and is getting disposed by the client once the client itself is disposed. As a result, any subsequent token generation request will hit an NRE on the _lock object being referenced internally.
Since DeviceAuthenticationWithTpm is a user supplied object, the lifetime management should be delegated to the user, and the sdk should not be disposing it. I will fix this.

In addition, there is another issue wherein reusing a token refresh enabled authentication method can result in the client not refreshing the sas tokens properly, I'll put a fix out for this as well.
In your scenario (3) below where you mention that the flow works as expected for a valid endpoint and gateway address, I suspect that you would hit an authorization related exception once your existing tokens expire and the client attempts token renewal. Since the default time to live is an hour, it is possible that this issue might be missed unless a long running test is initiated. Is this something that you have tested/ encountered at your end?

  1. The device connects to the DPS at least once to properly enroll. The device stores the IoT hub address it gets from the DPS.
  2. The device connects to the IoT Hub directly to fetch desired properties and disconnects from the IoT Hub after this. The software calls CloseAsync and disposes the DeviceClient instance because it is created in a using statement.
  3. After that the device connects to the IoT Hub via the gateway (OpenAsync with the IoT Hub address and gateway address). It uses the same DeviceAuthenticationWithTpm instance as in step 2 for this.

@its-phil
Copy link
Author

@abhipsaMisra Thanks for investigating this issue. I haven't seen the token-refresh issue you mentioned yet. If the time to live is one hour, there's a good chance I haven't tested my artificial scenario long enough.

Could you clear up one thing, please? I understood that the DeviceClient tries to refresh its token after a while (depending on the time-to-live). I assume the DeviceClient negotiates this token refresh with the gateway when it is connected to the gateway. Does the gateway have to have an online connection to the IoT Hub for token refresh to succeed?

@abhipsaMisra
Copy link
Member

@its-phil For a regular device client - IoT hub connection, yes, as you've said that the client will negotiate the token with the hub service and there needs to be an active connection. As for the scenario when you have a gateway in between, I believe that will depend on how the gateway has been designed. If the gateway that you're using is IoT Edge, I believe that it does support offline capabilities for a connected client.
Let me know if you're using IoT Edge and then I can get this confirmed from the Edge team.

@its-phil
Copy link
Author

Thanks, yes, the gateway is an Azure IoT Edge.

@abhipsaMisra
Copy link
Member

abhipsaMisra commented May 20, 2021

The Edge team has confirmed that reauthentication can happen offline.
Also, FYI - you can reach out to the Edge team here, in case you have any questions/ issues: https://github.com/Azure/iotedge/issues

@its-phil
Copy link
Author

Thank you very much for your support!

@abhipsaMisra
Copy link
Member

@its-phil Until we get this release out, a temporary workaround would be to create a new authentication method for each client instance that you initialize; i.e. instead of reusing the authentication method here, initialize a new instance of the DeviceAuthenticationWithTpm and pass it in.

@its-phil
Copy link
Author

@abhipsaMisra Thanks for the hint. As a matter of fact, that's what I already do to work around this.

@abhipsaMisra
Copy link
Member

This fix has been released in our latest release: https://github.com/Azure/azure-iot-sdk-csharp/releases/tag/2021-8-11
Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. fix-checked-in Fix checked into main or preview, but not yet released. IoTSDK Tracks all IoT SDK issues across the board
Projects
None yet
Development

No branches or pull requests

2 participants