-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spurious name conflicts #117
Comments
I agree that I have seen this from time to time, unfortunately I am not currently sure what causes it. I think in some cases it might be related to the reflector, but if that is not in use I am not sure. How often is this happening? I wonder if we can setup a long term pcap capture to try and figure out what happens. |
Rather frequently, I see it almost every odd day. It seems to be associated to a lease expire on the interface getting the address from dhcp and probably has to do with the fact that there is both an IPv4 and an IPv6 address configured for the interface...
|
I don't think that the reflector should be on anywhere, as it should be disabled by default, shouldn't it? |
Preventing avahi-daemon from using the interface where the address is received from a dhcp server makes the issue disappear, but obviously it is not a solution. |
Avahi can't handle inter-connected multi-homed systems. We have to use the option to disable one of the interfaces to avoid the daemon from seeing mutliple name registration requests (one from each network). Best I can tell there isn't a better solution since this is really an issue with the design of the protocol. |
Still I wonder...
|
I totally had this happen on one of my own systems, with a very similar looking log to you. Downed and uped a bunch of interfaces rapidly. There must definitely be a bug there I'll have to try and figure out if I can make it reproducible. Some kind of race to do with the new interfaces appearing while probing perhaps.. there is a related issue for services that get stuck registering. So maybe the logic for interfaces coming and going needs to be reviewed. |
OK I think I figured it out. What's happening is an address is withdrawn before it finishes probing, but we receive a copy of our own probe immediately after and thus assume a conflict (our own multicast probes are mirrored back to us by the kernel). A bit of a race condition. This happens a lot with IPv6 where we withdraw the fe80 link-local address once we receive a global address and can happen very rapidly on boot. Of note you are using IPv6 on your site, as well as mine where I am seeing this. On IPv4 address withdrawls while probing are quite uncommon. So we'll need to identify those in some way, either with a ghost list or otherwise determining that the probe looped back. I'll look at that. |
Confirmed the issue as I suspected, we withdraw our address record but only then receive a copy of our own probe and decide it is a conflict:
This happens because we revoke the link-local address from being advertised once we receive a global address. Hope to have a fix for this shortly |
Would a workaround be to disable ipv6 in the config when you're not using it? |
Any updates on this? |
Hey @lathiat, sorry for bugging you. I think this just happened to me as well. I have a usual IPv4/6 dual stack network at home and run avahi-daemon in a Docker container with network_mode host. This is the log:
|
Make sure you are not hitting a shortcoming in the protocol involving the daemon seeing other daemons through multiple paths. System announces itself through one interface and gets rejected after announcing itself on subsequent interfaces due to being a duplicate. I always use allow-interfaces/deny-interfaces to force avahi to use only a single interface (in my industry this is typically the management interface). After that I have not had this issue. |
There are two interfaces on my system, although only one is actually up and connected to the network. As far as I can see avahi-daemon only works on the connected interface (enp5s0), but I'll try manually allowing it. |
is anyone working on a fix for this? @lathiat said
exactly a year ago? Any progess? Thanks |
allow-interfaces will work around the issue as it's a bug in handling interfaces rapidly adding and removing addresses (particularly noticeable if you have globally routable IPv6 addresses, as we add then remove the link local address) Still planning a fix |
So, this is still happening to me. I've set |
I tried the allow-interfaces method as well but it's not working. Is there another work-around for this? How about a fix? @lathiat Any ETA? Many distros are reporting the same bug. |
The only workaround for me is a daily restart of avahi-daemon. I'll soon replace this by an automatic restart if the daemon logs the error message, but for now this works for me. Not ideal, but eh… it's just for my homelab and nothing critical. |
seems like a working work around is
|
Could you please test my attempt to log conflicts better in PR #554 ? That should help finding what exactly triggers the error. Is some reflection service responsible for conflict? I am sorry, but I am quite confident conflict resolution is needed and is the right thing to do. What is not right thing to do is conflicting with our own announcements. But we need to understand what is the primary cause of the issue before correcting it properly. That is not yet clear, at least to me. |
This has been a "Heisenbug" of the highest order, that's the simple reason why. There have also been very misguided attempts to make reproduction LESS frequent, which is the exact opposite of how you should deal with Heisenbugs.
What I find much more amazing than those 6-7 years is this:
After 6-7 years, why is it still required to patch and rebuild from source to just GET BETTER LOGS? This should really be as easy as editing a verbosity level in a configuration file. Is this software still maintained by anyone? Sure, verbosity levels often affect the reproduction rate of Heisenbugs. But that's not a reason to at least TRY. Also, the influence (if any) of the verbosity level on the reproduction rate is better data than no data at all.
+1 |
I've been experiencing this race condition on most of my single interface Ubuntu and Debian machines and VMs on my home network for so many years now I cannot even remember when it started. I recently had it happen on a host immediately after booting. Very frustrating to wake-on-lan a host, see it light up, yet find it doesn't appear to be on the network by name. Typically I see it happen after a host has been up for days during some routinely triggered network interface event that causes Avahi to Withdraw all interface records followed by immediately re-Registering them. Exactly the thing that'll eventually trigger the race condition and wind up falsely thinking its own advertisement is a conflict. I don't know what causes these frequent withdraw and re-registrations (v4 and v6 addresses and network interface names remain the same) - the default logs emitted by system daemons that might do it are insufficient to tell. It is a really rotten experience and entirely sours the supposed utility of mDNS for small "just a collection of hosts" LANs. |
Im coming back here because i found the root cause of my problem. In my case it was a docker container running an mDNS inside it that conflicted with the avahi mDNS on my host. This caused the conflicts, however i still think there should be a toggle in avahi to ignore collisions. I hope this helps some people as i saw a few other docker users here. Also a good warning is to never use docker's |
So I ran into this for a few months, and could have sworn something was amiss, in case anyone else runs into this. Just look for things listening on port 5353, and in my case it was KDE connect (along with having docker interfaces) caused this for me almost every time after a user was logged in. To check, use:
And it should just be avahi. |
Conflict check can be disabled by adding a |
Disabling conflict checks completely makes Avahi not compliant with mDNS RFC. Please do not advice that to anyone. It must conflict, if there is a different device using the same name and one of them must choose a different name. This problem happens because Avahi fails to identify it conflicts with itself only and there is no other device with the same name on any network it is connected to. |
Anyway, In my use case I know it won't be any different device using the same name and this fixed my issue perfectly. It just works! |
Do we have any proof this can happen without other device doing mdns reflection? That is there must be other device sending my machine announced record from different interface (and source address) than I am sending it? Probably with different link address too, since we should recognize our own link address announcing connected to the same network by bridge/switch. The best candidate for solvable description seems #117 (comment). On the other hand, #117 (comment) suggests the issue is in forgetting our addresses too fast. When some device creates duplication sent a bit later Solution for that might be storing removed local addresses marked with special flag at least the same time we wait for non-responding (non-existent) names. If we receive its query short after removed, just consider it still our own and ignore the conflict with it. |
This was only sharing a well-tested, one-line change, there was no Pull Request submitted, in fact not even a plain diff was shared. This was a great data point, please keep stuff like these coming; it's not like Linux distributions are going to excavate random experiments buried very deep down this bug and ship them tomorrow. Also, this bug has been opened for 7 years now, so at this stage I think everyone should be free to share pretty much whatever they like!
Interesting question, @happyme531 can you answer? |
Main problem we have with this is unclear reproduction steps. Yes, we have seen that happen sometime. But for creating and testing fix, we need to identify exact cases how that happen and rule out wrong network configurations. Ideally logs combined with mdns traffic recording on all used interfaces. Especially someone who sees this issue often might help with that. |
Never, ever run binaries provided by random people on the Internet. No offence @happyme531 but no one has any idea who you are. Nor from me, from @pemensik or from anyone else. (sorry I missed this earlier) |
I have no trouble reproducing this on an isolated VLAN containing three devices:
None of the systems have mDNS reflection enabled.
I used SNMP to virtually unplug and replug the test device from the ethernet switch (disable port, wait 2 seconds, enable port, wait 10 seconds, repeat in an infinite loop). Within a few minutes the following happened (IPv6 addresses partially censored):
Also, I just checked my laptop's avahi hostname, it's |
One of reasons you see those issues might be usage of systemd-networkd. If I understand it correctly, it swaps used addresses after lost link from normal DHCP lease to IPv4 link local address. That is not common behaviour seen with Network Manager, at least not default configuration. I hoped you could record this interation with wireshark or tcpdump, so we would see timing of packets and their source addresses too. Solution might be what @evverx mentioned at #554 (comment). Not giving up immediately, but retrying one second later. If there is a conflict indeed, it will get the same result again. If not, then it might have been random repetition of myself, delayed by whatever network elements. Remembering recent own withdrawn addresses should help too. Thank you for that steps. I guess we would need something similar done between VMs or containers. So it can be tested automated way and not depend on specific network device. |
Though we need to check whether recent development version behaves this way. I am not sure why you are testing avahi-daemon 0.7 version. Ideal reproducer should happen on snapshot version from master. There might be fixed related issues, which make it less reproducible since 0.7 release. But yes, using |
Just to add to distributions Bug lists, we had one also on Fedora. https://bugzilla.redhat.com/show_bug.cgi?id=1657325 I think we need to log source address of conflict, like I started at #554. It would be then clear that was our own address even from logs. Existing log does not help much. |
I think most of the conflicts mentioned here come from #554 (comment)
so that part can somewhat mitigate the issue. Ideally #554 (comment) should be implemented too to fix it once and for all. (I have a test suite that can trigger all those conflicts and none of the patches suggested here and in some other PRs fully fix them unfortunately)
I agree that it can be improved.
In light of https://www.bleepingcomputer.com/news/security/github-comments-abused-to-push-password-stealing-malware-masked-as-fixes/ and things like that I don't think it's even safe to quote comments with links because GitHub algorithms can think that they push malware (or whatever) too and their authors can get blocked too just in case. |
That is incorrect, all IPv4 addresses are removed when link is lost. Link-local IPv4 is disabled by default in systemd-networkd, although it is enabled on the test device. On the older version of systemd-networkd that's running on the test device this will unconditionally acquire a link-local IPv4 address, independent of DHCP. We've seen avahi's self-conflicts also prior to enabling link-local IPv4 on these devices, I don't think it's related. Link-local IPv4 is disabled on my laptop, which also has this problem.
I'll see if I can do that quickly, I don't really have a ton of time to spend on this right now.
Because that's what's running on this device (an embedded system), which is part of a test setup for which I already have scripting tools that let me "unplug" its ethernet link programmatically. Note that my laptop is avahi-daemon 0.8-10 (latest package from debian bookworm) and also experiences the problem, I've checked the logs and found at least one occurrence was on a network which definitely doesn't have an mDNS reflector. Until the root cause is found, changes that make it less reproducible just make it harder to debug. |
The issue is that avahi doesn't follow the RFC. #554 (comment) (where "MUST" is violated) and #554 (comment) (where stale probes aren't handled) lead to spurious conflicts. |
Here's another capture (still avahi-daemon 0.7) with link-local IPv4 and All mDNS packets logged are coming from the local avahi-daemon, there are no incoming mDNS packets within this timeframe. IPv6 addresses have been abbreviated to
|
avahi itself receives those packets (including probes as well) and when they match they are marked |
@mvduin could you run Received conflicting probe [C.local IN A 192.168.141.46 ; ttl=120]. should appear. It should make it easier to figure out what happens in places like
|
Yeah I know, I just meant that there were no mDNS packets being received from other hosts or stacks, hence the source of the conflict is necessarily Avahi's own probes. (I also explicitly mentioned it since I was abbreviating all IPv6 addresses in a way that would be ambiguous if more than one host were present in the logs) Here's with
|
To me your remark raises the question: in which way could a probe that avahi has sent and has received it back be different from all records in the list it compares against, in the incoming_probe() function?
(I have been trying to collect data points and compare to this theory by logging more detail from inside incoming_probe() but the "spurious name conflict" I had for weeks just has vanished for now) |
There is a way to reproducibly switch on / off the problem being discussed here, namely that avahi appends '-2' to the host name because of conflict - on the systems where I have the problem. Just have librespot running or not. librespot integrated their own mdns responder in commit b25585a41b7a3cf35776e20345e5718c3abf16b7 back in 2016 which could be the source of the problem. Maybe this is the root cause for many people being affected, rather than possible shortcomings in the avahi code. |
That just sounds like librespot is publishing conflicting records, and the resulting rename is just Avahi behaving correctly. This thread is specifically about Avahi spuriously conflicting with itself
It 100% is, you don't want multiple mDNS stacks running on the same machine. The correct way to publish an mDNS service is by using the system mDNS stack, i.e. Avahi on linux. Using a custom mDNS stack should be avoided unless no system mDNS stack is available, or if some software insists on using one anyway then it must use a randomly generated hostname for publishing records, not the system hostname. |
I'm not sure what librespot does but if it does what the RFC says it advertises its link-local and global addresses at the same time and that should be handled by avahi properly with no conflicts. It doesn't work like that currently though #243. If that part of the RFC was implemented the spurious conflicts (including the ones where avahi conflicts with its own link-local addresses) would be much less likely to pop up as well. |
Hi, hope this is the right place for reporting issues with avahi-daemon as the readme on my system points to the bugtracker on freedesktop.org that does not list avahi as a bug report target.
I am experiencing spurious name conflicts on various systems, all of which have a common trait in having two interfaces, one on the local lan, having a static IP address and the other getting a dhcp address from somewhere (typically an ADSL router).
What happens is the following. Suppose that the host is called "foo". Initially, it is correctly advertised as foo.local. After some time the name conflict occurs and the host starts being advertised as foo-2.local, foo-3.local, etc., even if it is certainly the sole host named foo on the network. In practice there is a spurious name conflict with the host itself, probably due to some race in avahi. The unfortunate result is that no other system cannot find "foo" no more on the network, since they look for foo.local.
I see the issue on a couple of debian jessie systems (avahi version 0.6.31); on a raspbian jessie system (same); and on an openwrt chaos calmer system (avahi version 0.6.31 again).
I see a lot of reports for this same issue (or possibly something similar) on many distro bugtrackers, applications bugtrackers and question sites:
I wonder if there is something misconfigured on my systems (and in this case some hit at diagnosing would be appreciated) or if this is an issue (possibly a race) with the avahi daemon.
Even if this cannot be fixed rapidly, I'd like to suggest an interim point release of avahi with an option to disable the name conflict analysis when he/she is absolutely sure that it won't be needed on his/her network.
The text was updated successfully, but these errors were encountered: