-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NameResolutionPal.Unix enabled async name resolution #34633
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor notes for review
src/libraries/Common/src/Interop/Unix/System.Native/Interop.GetHostName.cs
Outdated
Show resolved
Hide resolved
So this change was a bad idea ;-)
Before I didn't last the async operation as it was on the stack. Now it's part of the GetAddrInfoAsyncState which gets heap allocted, so it last. I'm not sure if addrinfo needs to last the async operation, as the native tests pass either way. This change makes it more correct, nevertheless.
static void GetHostEntryForNameCreateHint(struct addrinfo* hint) | ||
{ | ||
// Get all address families and the canonical name | ||
|
||
memset(hint, 0, sizeof(struct addrinfo)); | ||
hint->ai_family = AF_UNSPEC; | ||
hint->ai_flags = AI_CANONNAME; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the parameter to getaddrinfo()
is constant, this could be defined statically and the address passed like that everywhere where it's needed:
static void GetHostEntryForNameCreateHint(struct addrinfo* hint) | |
{ | |
// Get all address families and the canonical name | |
memset(hint, 0, sizeof(struct addrinfo)); | |
hint->ai_family = AF_UNSPEC; | |
hint->ai_flags = AI_CANONNAME; | |
} | |
static const struct addrinfo s_hostEntryForNameCreateHint = { .ai_family = AI_UNSPEC, .ai_flags = AI_CANONNAME }; |
int result = gai_error(&state->gai_request); | ||
HostEntry* entry = state->entry; | ||
|
||
int ret = ConvertGetAddrInfoAndGetNameInfoErrorsToPal(result); | ||
|
||
if (result == 0) | ||
{ | ||
const uint8_t* address = state->address; | ||
struct addrinfo* info = state->gai_request.ar_result; | ||
|
||
ret = GetHostEntries(address, info, entry); | ||
} | ||
|
||
free(state); | ||
|
||
if (callback != NULL) | ||
{ | ||
callback(entry, ret); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int result = gai_error(&state->gai_request); | |
HostEntry* entry = state->entry; | |
int ret = ConvertGetAddrInfoAndGetNameInfoErrorsToPal(result); | |
if (result == 0) | |
{ | |
const uint8_t* address = state->address; | |
struct addrinfo* info = state->gai_request.ar_result; | |
ret = GetHostEntries(address, info, entry); | |
} | |
free(state); | |
if (callback != NULL) | |
{ | |
callback(entry, ret); | |
} | |
int ret = ConvertGetAddrInfoAndGetNameInfoErrorsToPal(gai_error(&state->gai_request)); | |
if (ret == 0) | |
{ | |
const uint8_t* address = state->address; | |
struct addrinfo* info = state->gai_request.ar_result; | |
ret = GetHostEntries(address, info, entry); | |
} | |
if (callback != NULL) | |
{ | |
callback(state->entry, ret); | |
} | |
free(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the free(state)
intentionally before calling the callback.
entry
is passed from managed to native and via callback back to managed -- so the position offree(state)
doesn't change anything- if the callback (managed side) has an exception
state
won't be freed an so this can be a memory leak -- althoughProcessResult
on the managed side (the "callback worker") has atry-finally
I prefer leaving the free(state)
before the callback. When there is a good reason to move it after the callback, I'll do it for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are equivalent, yes, but the suggested change is slightly shorter (and doesn't have a variable that's attributed to but never read from, which is cleaner).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It's just a suggestion, no need to change it at all.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I like it as the code looks cleaner.
Anyway I highly appreciate the feedback here! (My C is a bit rusty, so it's good to get some hints.)
int32_t SystemNative_PlatformSupportsGetAddrInfoAsync() | ||
{ | ||
#if HAVE_GETADDRINFO_A | ||
return 1; | ||
#else | ||
return 0; | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid this P/Invoke every time?
Also, since cmakedefine01 is used, maybe this could be rewritten as:
int32_t SystemNative_PlatformSupportsGetAddrInfoAsync() | |
{ | |
#if HAVE_GETADDRINFO_A | |
return 1; | |
#else | |
return 0; | |
#endif | |
} | |
int32_t SystemNative_PlatformSupportsGetAddrInfoAsync() | |
{ | |
return HAVE_GETADDRINFO_A; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cached on the managed side.
Nice trick to return the define. I didn't think about that. Thx.
Will address the other points tomorrow.
struct GetAddrInfoAsyncState* state = malloc(sizeof(struct GetAddrInfoAsyncState)); | ||
|
||
state->gai_request.ar_name = (const char*)address; | ||
state->gai_request.ar_service = NULL; | ||
state->gai_request.ar_request = &hint; | ||
state->gai_request.ar_result = NULL; | ||
state->gai_requests = &state->gai_request; | ||
|
||
state->sigevent.sigev_notify = SIGEV_THREAD; | ||
state->sigevent.sigev_value.sival_ptr = state; | ||
state->sigevent.sigev_notify_function = GetHostEntryForNameAsyncComplete; | ||
|
||
state->address = address; | ||
state->entry = entry; | ||
state->callback = callback; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct GetAddrInfoAsyncState* state = malloc(sizeof(struct GetAddrInfoAsyncState)); | |
state->gai_request.ar_name = (const char*)address; | |
state->gai_request.ar_service = NULL; | |
state->gai_request.ar_request = &hint; | |
state->gai_request.ar_result = NULL; | |
state->gai_requests = &state->gai_request; | |
state->sigevent.sigev_notify = SIGEV_THREAD; | |
state->sigevent.sigev_value.sival_ptr = state; | |
state->sigevent.sigev_notify_function = GetHostEntryForNameAsyncComplete; | |
state->address = address; | |
state->entry = entry; | |
state->callback = callback; | |
struct GetAddrInfoAsyncState* state = malloc(sizeof(*state)); | |
if (state == NULL) return oops_no_memory; | |
*state = (struct GetAddrInfoAsyncState) { | |
.gai_request = { | |
.ar_name = ..., | |
.ar_service = ..., | |
..., | |
}, | |
.sigevent = { | |
.sigev_notify = ..., | |
..., | |
}, | |
.address = ..., | |
... | |
}; |
// Actually not needed, but otherwise the parameters are unused. | ||
if (address != NULL && entry != NULL && callback != NULL) | ||
{ | ||
return -1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Actually not needed, but otherwise the parameters are unused. | |
if (address != NULL && entry != NULL && callback != NULL) | |
{ | |
return -1; | |
} | |
(void)address; | |
(void)entry; | |
(void)calback; |
{ | ||
struct GetAddrInfoAsyncState* state = (struct GetAddrInfoAsyncState*)context.sival_ptr; | ||
|
||
atomic_thread_fence(memory_order_acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the fence necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This callback comes in on another thread. It ensures visibility to this thread of everything in state
.
getaddrinfo_a
probably does something like this already, so it might be safe to remove, but it isn't documented.
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do this in a static field initializer?
private static readonly bool s_supportsGetAddrInfoAsync = Interop.Sys.PlatformSupportsGetAddrInfoAsync();
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to have it the same as the windows-implementation. But I see (now) that there is a need to start the socket, which isn't needed on linux.
I'll update to this.
Are you still experiencing the crash/hang you mentioned in the PR message? |
Yes, unfortunately. |
Have you tried either running under Valgrind or building with AddressSanitizer to have a better insight? |
With pure native execution the bug does not show up. Only when called via P/Invoke, i.e. from the managed unit test with testhost. Do you know how to incorporate AddressSanitizer with the testhost? |
I've never tested this with testhost, but it looks like if you build a debug build with the |
Thanks. Will try this tomorrow. When I read the GDB dump (see attachment in top post) correct, I have the assumption that P/Invoke for Linux and threads has an issue. But I'm not really able to interpret this correctly, because of lack of knowledge with these internals. Hopefully that's not the case and P/Invoke is correct. |
Valgrind outputs quite some messages -- even for Running -vex amd64->IR: unhandled instruction bytes: 0x48 0xE9 0xB0 0x35 0x42 0xC9 0xE8 0x2B
+vex amd64->IR: unhandled instruction bytes: 0x48 0xE9 0xB0 0x35 0x41 0xC9 0xE8 0x2B
vex amd64->IR: REX=1 REX.W=1 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0
-==xxxx== valgrind: Unrecognised instruction at address 0x3d9bafca.
-==xxxx== at 0x3D9BAFCA: ???
-==xxxx== by 0x3D41F58F: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
-==xxxx== by 0x3D443245: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
-==xxxx== by 0x3D45BCC2: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
+==xxxx== valgrind: Unrecognised instruction at address 0x3d9cafca.
+==xxxx== at 0x3D9CAFCA: ???
+==xxxx== by 0x3D42F58F: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
+==xxxx== by 0x3D453245: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
+==xxxx== by 0x3D46BCC2: ??? (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/System.Private.CoreLib.dll)
==xxxx== by 0x6DDD8BE: CallDescrWorkerInternal (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/libcoreclr.so)
==xxxx== by 0x6D09CCA: MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/libcoreclr.so)
==xxxx== by 0x6C5675D: CorHost2::_CreateAppDomain(char16_t const*, unsigned int, char16_t const*, char16_t const*, int, char16_t const**, char16_t const**, unsigned int*) (in /home/gfoidl/dotnet-repos/runtime/.dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.3.20169.1/libcoreclr.so) See attached files for raw output. Will investigate further and try to incorparate some sanitizers. |
I have a clue now. The This can be seen in the following trace:
In the callback the address is garbage. A trivial workaround / fix would be duplicating the string in the native side ( |
No (and if there were, the best you could hope for would be "until it's manually free'd by the native code", since the runtime has no idea what the lifetime is if it's not tied to the synchronous method call's lifetime). If you want to manage it yourself, you can just change the P/Invoke to take a |
It looks like One way to deal with this is to make the library optional and make the feature conditional on |
yes, this is part of glibc - so it will not be there in MUSL based distributions @VSadov . I look as I was concerned about extra dependency but I think it is ok to use swhen part of glibc bundle. |
If you go the route of making this conditional on Linux, you would need to add similar probe for We should think about reducing duplication like this going forward, but for the time being it could be acceptable solution. |
Do you have plan how to move forward @gfoidl ? (I'm not pushing but making sure you are not stuck) It seems like this got more complicated beyond base DNS changes ;( |
Completed
Open questions (from review)
What's left?Async name resolution on Unix relies on glibc-functionality ( Here comes the trouble with "single file host builds in installer partition" and for me it's hard to follow what's going on, since this is pretty much down the entrails of .NET 😉 From #34633 (comment): If I read #34633 (comment) correct, so the check for |
@wfurt good news: CI is green 🎉 -- PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @gfoidl. thanks for the contribution.
I caused one more merge conflict by rushing fix to servicing. I think that should be simple to resolve.
I will merge as soon as the PR is green again (and sorry for the troubles)
@@ -66,29 +64,31 @@ public void GetHostEntry_InvalidHost_LogsError() | |||
|
|||
[ConditionalFact] | |||
[PlatformSpecific(~TestPlatforms.Windows)] // Unreliable on Windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Could be use CI stability while back. Perhaps best would be do this separately e.g. put this in and do separate PR to enable this in ci and watch the outcome. If we see failures, we can try to reproduce locally or we can try to collect more info.
[Theory] | ||
[InlineData(false)] | ||
[InlineData(true)] | ||
public async Task GetAddrInfoAsync_ExternalHost(bool justAddresses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably would not at this point. Since all DNS tests depend on external server, we would need to make all of them outer loop IMHO. For now, I would assume that base resolution work and I would leave it as inner loop. was thinking while back about running batch of resolutions to make sure DNS work and perhaps prime caches so test runs are more predictable. I think we can leave is is for now and react if we see CI failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the cmake changes
iOS.Simulator.Aot.Test Work Item fails due to this test TryGetAddrInfo_EmptyHost. The check for the operating system got taken over from existing tests, which have the comment // On Unix, we are not guaranteed to be able to resove the local host. The ability to do so depends on the
// machine configurations, which varies by distro and is often inconsistent. As more and more different OS come into play, maybe it's safer to flip this check and test for |
@wfurt CI looks green 😃 For the test on windows I'll follow #34633 (comment) and create a separate PR for this, so it can be handled in isolation. |
@gfoidl thanks a lot for your PR!!! Also, appreciate your patience with this one ;) |
This was an interesting journey with some very valuable input (some C hints, and being down at the PAL to see how things work / a place where a usual dotnet-developer won't be), so I'd say: perfect (except that it took that long, but now it's in so this doesn't matter). |
dotnet/runtime#34633 introduced a dependency on this.
dotnet/runtime#34633 introduced a dependency on this.
Fixes #33378
Also
System.Net.NameResolution.Pal.Tests
Edit: this issue is fixed.
Marking this PR as draft, as there is some bug anywhere which I can't find.
To check the native implementation 8c955bc adds some (c)tests -- this will be reverted later -- and they pass without any problem.
The managed implementation is analogous to the windows' one.
The bug causes either a hang or a segfault.
For the latter I can see GDB Output (see core-dump.txt).
Maybe it's a dumb failure on my side, which I don't see atm, or something different.
If the test (added with 4614267) doesn't hang, the native side it will call into the OS with return code
0
. The crash happens anywhere after the OS-call and the invocation of the callback. Unfortunately I'm not able to debug it further (due to lack of knowledge here deep down that stack).