Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize embedded resolver Start and Stop #1561

Merged
merged 1 commit into from
Nov 21, 2016
Merged

Conversation

sanimej
Copy link

@sanimej sanimej commented Nov 18, 2016

If a container comes up and brought down quickly its possible for resolver.Stop() to get called before resolver.Start() completes. In that case r.conn can be nil

s := &dns.Server{Handler: r, PacketConn: r.conn}

dns library code checks only if PacketConn is nil which is an interface type. Because of the missing check for interface data being nil this results in a panic.

https://github.com/miekg/dns/blob/3f1f7c8ec9ead89493df11f2c3d8bec353a2c2c0/server.go#L388

related to
docker #28112
docker #28465

Signed-off-by: Santhosh Manohar [email protected]

// valid before Activating the DNS server. This avoids a panic in dns
// libarary code because of missing nil check for interface's data.
// github.com/docker/docker/issues/28112
if r.conn == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reduces the chances for the bug but does not eliminate them. We need to Lock the resolver before accessing/modifying its members. And we need to pass a reference of r.conn once acquired under lock.

Also, wouldn't the same issue be there also when we construct the tcpServer ?

Copy link
Author

@sanimej sanimej Nov 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't eliminate the bug completely. But setupIPTable() re-execs and programs the iptabels rules in container's net ns. This is an expensive operation. So adding the check after setupIPTable() significantly reduces the chance of hitting this bug. I prefer not to add more locks in our code. Instead I am going to push a PR in miekg/dns code to check the interface datatype being nil. That will avoid the panic completely.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the same issue be there also when we construct the tcpServer ? - with this change we will return from Start() if resolver's UDPConn is nil. Its not possible to have resolver.conn != nil && resolver.tcpListen == nil. But I think there is no harm in checking for resolver.tcpListen as well. Will change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not possible to have resolver.conn != nil && resolver.tcpListen == nil

Yes it is, the stop() may be called after the start() started the udp listener but before it started the tcp listener.

Instead I am going to push a PR in miekg/dns code to check the interface datatype being nil

I do not think it is the library responsibility if we nil the net.PacketConn after initializing its server structure.

I prefer not to add more locks in our code

I am not sure I understand the opposition versus locks when we have the same structure being accessed/modified by different threads.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, the stop() may be called after the start() started the udp listener but before it started the tcp listener.

Both udp & tcp sockets are created in SetupFunc before Start(). Unless both are created resolver.err will be non-nil which is already checked. So its not possible to have resolver.conn != nil && resolver.tcpListen == nil

I do not think it is the library responsibility if we nil the net.PacketConn after initializing its server structure.

We are not setting net.PacketConn to nil after creating dns.Server; rather net.PacketConn is set to r.conn which happens to be nil. Since net.PacketConn is an interface type the library should check the interface value before accessing it. ie: the following is not safe

https://github.com/miekg/dns/blob/3f1f7c8ec9ead89493df11f2c3d8bec353a2c2c0/server.go#L392

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the opposition versus locks when we have the same structure being accessed/modified by different threads.

For the reasons mentioned earlier its not needed for this issue. But to be safe we can lock around the resolver fields.

@aboch
Copy link
Contributor

aboch commented Nov 18, 2016

This also fixes #1405

@sanimej sanimej force-pushed the intfnil branch 2 times, most recently from afccf9c to c5bb4cd Compare November 20, 2016 03:01
@sanimej sanimej changed the title Check for valid socket before activating DNS server Serialize embedded resolver Start and Stop Nov 20, 2016
@@ -140,38 +142,67 @@ func (r *resolver) Start() error {
if r.err != nil {
return r.err
}
// startCh is to seralize resolver Start and Stop
r.Lock()
r.startCh = make(chan struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we initialize the channel in the constructor NewResolver() ?
That way I think we can avoid adding the mutex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewResolver() is called once for a sandbox. But for a given resolver Start and Stop can happen more than once; ie: once per container restart. The intent of the change is to serialize a pair of Start and Stop calls. That is the reason for not creating startCh in the constructor.

Copy link
Contributor

@aboch aboch Nov 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the problem in serializing any Start()/Stop() sequence which may span across container restarts.

Given the base code does not support concurrent Start()/Stop(), the safest approach is in fact serializing this couple of functions, regardless.

Also, I am not comfortable with the amount of code refactoring in this PR given it targets a panic fix for docker 1.13 while we are in the 1.13 RC process.

In less abstract terms, I am comparing this changes with the following minimal change: https://github.com/aboch/libnetwork_new/commit/e424dce7938d2cbc5ec003354597c3e98a343939

}
if r.tcpServer != nil {
r.tcpServer.Shutdown()
func (r *resolver) waitForStart() {
Copy link
Contributor

@aboch aboch Nov 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this function ?
It should be enough to call

r.startCh <- true
defer func() { <-r.startCh }()

at the beginning of Start() and Stop()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As clarified earlier there can be more than one Start and Stop calls for a resolver. This change serializes a pair of Start & Stop calls. With the suggested change a subsequent Start can block on a previous Stop which is to be avoided.

Besides, there is a personal preference rather than technical correctness aspect here. To me having a function waitForStart conveys the intent more clearly and easier to follow.

@zgfh
Copy link

zgfh commented Nov 21, 2016

LGTM

@sanimej
Copy link
Author

sanimej commented Nov 21, 2016

@aboch Changed to the buffered channel approach. PTAL

@aboch
Copy link
Contributor

aboch commented Nov 21, 2016

Thanks @sanimej

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants