Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

Scaling an app down while a build is running leads to unpredictable results #1224

Open
deis-admin opened this issue Jan 19, 2017 · 8 comments
Labels

Comments

@deis-admin
Copy link

From @jeff-lee on November 5, 2015 22:47

I'm running into an issue in v1.12.0 where scaling down an app while a build is running can result in either:

a) The new containers getting shut down and the build hanging
b) Zero running containers

I started a new cluster and scaled the example-go app up to 3.

$ fleetctl list-units|grep jefftest
jefftest_v74.web.1.service  a5ea5dc1.../10.10.17.144    active      running
jefftest_v74.web.2.service  6b548706.../10.10.19.9      active      running
jefftest_v74.web.3.service  6b548706.../10.10.19.9      active      running

I then started a build ( v75 ) and scaled the app down from 3 to 2 when the node started pulling the new containers down.

$ deis ps:scale web=2 -a jefftest
Scaling processes... but first, coffee!
done in 5s
=== jefftest Processes
--- web:
web.1 up (v74)
web.2 up (v74)

At this point, the v75 container gets stopped and the build ( with HEALTHCHECK_URL set ) hangs.

Thu Nov  5 22:06:45 UTC 2015
cda30e1fda8e        10.10.16.243:5000/jefftest:v75   "/runner/init start    1 seconds ago        Up Less than a second   0.0.0.0:32901->5000/tcp   jefftest_v75.web.1
2598c80e0985        10.10.16.243:5000/jefftest:v74   "/runner/init start    About a minute ago   Up About a minute       0.0.0.0:32900->5000/tcp   jefftest_v74.web.3
9d4614e6fb3f        10.10.16.243:5000/jefftest:v74   "/runner/init start    2 minutes ago        Up 2 minutes            0.0.0.0:32899->5000/tcp   jefftest_v74.web.2
Thu Nov  5 22:06:46 UTC 2015
9d4614e6fb3f        10.10.16.243:5000/jefftest:v74   "/runner/init start    2 minutes ago       Up 2 minutes        0.0.0.0:32899->5000/tcp   jefftest_v74.web.2
Thu Nov  5 22:06:47 UTC 2015
9d4614e6fb3f        10.10.16.243:5000/jefftest:v74   "/runner/init start    2 minutes ago       Up 2 minutes        0.0.0.0:32899->5000/tcp   jefftest_v74.web.2
Thu Nov  5 22:06:48 UTC 2015
9d4614e6fb3f        10.10.16.243:5000/jefftest:v74   "/runner/init start    2 minutes ago       Up 2 minutes        0.0.0.0:32899->5000/tcp   jefftest_v74.web.2
Thu Nov  5 22:06:49 UTC 2015
9d4614e6fb3f        10.10.16.243:5000/jefftest:v74   "/runner/init start    2 minutes ago       Up 2 minutes        0.0.0.0:32899->5000/tcp   jefftest_v74.web.2

I have also seen all of the containers get stopped when scaling from 3 to 2. Though I have only been able to reproduce this when HEALTHCHECK_URL is not set so far.

14:42:50 [ds12] - /Users/jefflee
$ deis ps:scale web=2 -a jefftest
Scaling processes... but first, coffee!
done in 6s
=== jefftest Processes

14:44:49 [ds12] - /Users/jefflee
$ deis info -a jefftest
=== jefftest Application
updated:  2015-11-05T22:44:49UTC
uuid:     20949ab0-ffbd-4442-b490-f7b01951976b
created:  2015-11-05T18:43:43UTC
url:      jefftest.ds12.therealreal.com
owner:    jefflee
id:       jefftest

=== jefftest Processes

=== jefftest Domains

Copied from original issue: deis/deis#4719

@deis-admin
Copy link
Author

From @carmstrong on November 5, 2015 23:37

Though I have only been able to reproduce this when HEALTHCHECK_URL is not set so far.

Have you seen any issues with HEALTHCHECK_URL set? We strongly recommend using this as a best practice for app deploys, since by default we will consider all running containers to be live and healthy.

@deis-admin deis-admin added the bug label Jan 19, 2017
@deis-admin
Copy link
Author

From @jeff-lee on November 6, 2015 0:9

I haven't been able to reproduce the container=0 issue yet when HEALTHCHECK_URL is set, but the shutdown of the new containers and hang of the builder does still happen.

I have also seen 502 Bad Gateway and 404's when it's set if I scale down late enough in the deploy process.

@deis-admin
Copy link
Author

From @carmstrong on November 6, 2015 0:11

I have also seen 502 Bad Gateway and 404's when it's set if I scale down late enough in the deploy process.

In general, I don't know if we make any guarantees when scaling an app up/down while a deploy is already running - the controller is executing logic as to how many containers to scale up/down based on the current number it sees.

Is there a use case for this, @jeff-lee, or are you just doing resiliency testing?

@deis-admin
Copy link
Author

From @jeff-lee on November 6, 2015 0:56

@carmstrong I was doing resiliency testing of the build process and this popped up.

Having said that, our CI is pushing builds to staging and qa throughout the day so I don't think it would be unusual for someone to try to scale an app without knowing that a build might be in progress.

It would be less of an issue in production since that's a more controlled process.

@deis-admin
Copy link
Author

From @mboersma on November 11, 2015 15:54

I don't think it would be unusual for someone to try to scale an app without knowing that a build might be in progress

Sounds like a common case that Deis should handle gracefully.

@deis-admin
Copy link
Author

From @bacongobbler on January 22, 2016 0:50

I'm not sure if there's an easy way to resolve this reliably. There are a lot of concurrency issues related to Deis. This is one of them. Perhaps at some point we could use something that acts a single source of truth to tells us when the builder is performing a build, but I don't see an easy solution to this problem that we could tackle for the LTS release.

@deis-admin
Copy link
Author

From @bacongobbler on January 22, 2016 0:52

see also deis/deis#4746

@Cryptophobia
Copy link
Contributor

This issue was moved to teamhephy/controller#34

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants