Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent data on nodes #150

Closed
F21 opened this issue Sep 29, 2015 · 117 comments
Closed

Persistent data on nodes #150

F21 opened this issue Sep 29, 2015 · 117 comments

Comments

@F21
Copy link

F21 commented Sep 29, 2015

Nomad should have some way for tasks to acquire persistent storage on nodes. In a lot of cases, we might want to run our own hdfs or ceph cluster on nomad.

That means, things like hdfs' datanodes needs to be able to reserve persistent storage on the node it is launched on. If the whole cluster goes down, once its brought back up, the appropriate tasks should be launched on its original nodes (where possible), so that it can gain access to data it has previously written.

@zrml
Copy link

zrml commented Sep 30, 2015

+1
we also want to mount a specific FS of a shared storage volume...
we need an API/flag to be able to specify this affinity

@F21
Copy link
Author

F21 commented Oct 12, 2015

It would be awesome if direct attached storage can be implemented as a preview of some sort.

One of the things that came to my mind is the idea of updating containers while keeping the storage:

For example, let's say we have a MySQL 5.6.16 container running and it was allocated 20GB of storage on a client to store its data. If there's a new version of the MySQL container (5.6.17), we want to be able to swap out the container but still keep the storage (persistent data) and have it mounted into the updated container. This way, we can upgrade the infrastructure containers without data loss or having a complicated upgrade process that requires backing up the data, upgrading then restoring it.

@zrml
Copy link

zrml commented Oct 13, 2015

@F21 good use case too.

@cbednarski
Copy link
Contributor

Just to add some details to this, we need two overarching features: one is the notion of persistent storage that is either mounted into or otherwise persisted on the node, and is accounted for in terms of total disk space available for allocations.

The second is global tracking of where state is located in the cluster so jobs can be rescheduled with hard (mysql) / soft (riak) affinity for nodes that already have their data, and possibly a mechanism to reserve the resources even if the job fails or is not running.

Since these features are quite large we would implement them gradually. For example floating persistence (a la EBS), node-based persistence, global storage state tracking, soft affinity, hard affinity, and offline reservations as independent milestones. I yet can't speak to when / if we will implement these features.

@zrml
Copy link

zrml commented Oct 15, 2015

@cbednarski sounds like you guys are going in the right direction! Cool!

@melo
Copy link

melo commented Oct 18, 2015

Hi @cbednarski, thanks for the explanation of your goals.

This is something that we would also like to see. Other services that we would like to manage that require data storage are Redis Sentinel, Consul, and NSQ.

Currently we dedicate nodes to this tasks, so affinity is something that we manage manually. I understand that other use cases might need/want some more magical targeting, but it would be interesting to see some way of manually deciding this before deciding on how this fits into the overall model.

My point is that if nomad provides some way of manually assigning data volumes to containers, and leave the logic of making sure the containers only start on the correct hosts to manual configuration, then we could start to get a feeling of how it all works, and with that experience, design better models afterwards.

I found this in the code:

func (d *DockerDriver) containerBinds(alloc *allocdir.AllocDir, task *structs.Task) ([]string, error) {
I'm guessed SharedDir is a step in that direction, right?

Thank you,

@diptanu
Copy link
Contributor

diptanu commented Oct 18, 2015

+1 to @cbednarski's thoughts.

We might also need to think about identity of data volumes. Data Volumes are slightly different than other compute resources like cpu/memory/disk etc which are scalar in nature since these are resources which can be loaned to any processes that might need compute resources where as volumes are usually used by the same type of process which created it in the first place and users might need to refer to a volume while specifying a task definition.

For ex -

resources {
   ....
   volumes = [
     {
        name = "db_personalization_001",
        size = 100000,
      },
   ]
}

In this example we are asking Nomad to place the task on a machine where the volume named db_personalization_001 exists and if it doesn't exist Nomad can create the volume on a machine where it can provide 100GB of disk space and that matches the other constraints that the user might have mentioned. While creating the new volume if a volume with that identity wasn't already present in the cluster we would also need to persist the identity of the volume in a manner which can be restored during a disaster recovery operation..

@F21
Copy link
Author

F21 commented Nov 18, 2015

Maybe it's possible to lean on https://github.com/emccode/rexray for non-local storage such as Amazon EBS. It doesn't manage persistence on local disks though, so that portion would still need to be implemented.

@gourao
Copy link

gourao commented Nov 19, 2015

I am one of the maintainers of https://github.com/libopenstorage/openstorage. The goal of this project is to provide persistent cluster aware storage to Linux containers (Docker in particular). It supports both data volumes as well as the Graph driver interface. So your images and data are persisted in a multi node scheduler aware manner. I hope this project can help what Nomad would like to achieve.

The open storage daemon (OSD) itself runs on every node as a Docker container. They discover other OSD nodes in the cluster via a KV DB. An container run via the Docker remote API can leverage volumes and graph support from OSD. OSD in turn can support multiple persistent backends. Ideally this would work for Nomad without doing much.

The specs are also available at openstorage.org

@F21
Copy link
Author

F21 commented Nov 19, 2015

@gourao That sounds really exiciting! Are there any plans to support things beyond docker: qemu, rkt, raw exec etc?

@gourao
Copy link

gourao commented Nov 19, 2015

Yes @F21, that's the plan. There are a few folks looking at rkt support, and as the OCI spec becomes more concrete, this will hopefully be a solved problem.

@erSitzt
Copy link

erSitzt commented Dec 7, 2015

+1
Access to persistent storage mounted or available directly on the node would be great.

While testing nomad with simple containers i did not realize that there was no option in the job syntax for bind mounts which i used when dealing with docker directly. :(

I like @diptanu's proposal.
But wouldn't it be easier to just let users specify volumes to mount into a container in a way we do it with docker directly ? Nomad could check the existence of the path and the free space for that mountpoint.

As @melo mentioned nomad is already doing something like this

func (d *DockerDriver) containerBinds(alloc *allocdir.AllocDir, task *structs.Task) ([]string, error) {

Most other tools to manage docker containers allow users specify volumes on container creation (Marathon, Shipyard for Example... Kubernetes too i think?)

I'm a novice in go, so i did not try anything myself by now. :)

@ketzacoatl
Copy link
Contributor

Both #62 and #630 are tracking this simpler use case of mounting a path from the host as a volume mount for the docker container.

@bscott
Copy link

bscott commented Dec 28, 2015

+1

Any timeframe on this as volumes not being supported in Nomad is a huge deal breaker for us using Nomad.

@jefflaplante
Copy link

+1
I agree with Brian on this.

@ketzacoatl
Copy link
Contributor

Yea, for my initial use it is acceptable to enable raw_exec and work around this issue, but that is only because this is not yet truly production use. I too could not put nomad in production without the most basic docker volume mount to the host being supported by the docker driver.

@adrianlop
Copy link
Contributor

+1 need docker volumes too for production.

@wyattanderson
Copy link

I'm interested in this not just for Docker, but also for qemu and an in-house Xen implementation. That is, it would be nice if the solution was generic enough to be useful for all task drivers.

@dkerwin
Copy link

dkerwin commented Jan 7, 2016

+1 no way to use in production without docker volumes

@calvn
Copy link

calvn commented Jan 8, 2016

👍

@supernomad
Copy link

So I have been running into this issue myself, as its a pretty fundamental idea to use volumes in conjunction with docker.

I understand there is a much larger architecture and design discussion to have around how to manage storage using Nomad in general. However when I was thinking about the issue, I came to the idea of specifying arbitrary commands to pass on down to docker.

Something like this:

            config {
                image = "registry.your.domain/awesome_image:latest"
                command = "/bin/bash"
                args = ["-c", "/usr/bin/start_awesome_image.sh"]
                docker_args = ["-v", "/host/path:/container/path", "--volume-driver=vDriver"]
            }

This would be entirely un-monitored via Nomad, and placing the container so that its volumes worked would be up to the end user, i.e. they would specify the necessary constraints on the job.

No idea if this is even possible, but figured I would voice the idea at the very least.

@jhartman86
Copy link

👍 for --volumes flag. another great use case: running a cadvisor container as a system service on all nodes that can pipe stats to oh, say, influxdb. In this sense, it has less to do w/ persistent storage than providing volume mounts to the container to monitor the underlying host. Per the cadvisor docs on getting it running:

docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  google/cadvisor:latest

@let4be
Copy link

let4be commented Feb 5, 2016

Any way I could use
https://docs.docker.com/engine/extend/plugins_volume/
and
https://github.com/ClusterHQ/flocker
with docker and nomad today?

this seems like an ultimate solution for my needs

@let4be
Copy link

let4be commented Feb 5, 2016

or probably I should use something simpler, like https://github.com/leg100/docker-ebs-attach
hm...

@dadgar
Copy link
Contributor

dadgar commented Feb 5, 2016

@let4be: Not currently. There is no support for persistent volumes in Nomad currently

@faddat
Copy link

faddat commented Aug 22, 2016

@dvusboy @far-blue @a86c6f7964

THE SOLUTION WAS SO BLINDINGLY OBVIOUS! (yet I couldn't see it)
:).

Thanks!

@dadgar
The Hashicorp suite of tools is fan-freaking-tastic: You guys just keep doing what ya do :).

@cetex
Copy link

cetex commented Aug 28, 2016

@dadgar The most important and very simple feature i'd like to see is that we can do a simple bind-mount into the containers (docker's -v option, something similar for rkt and whatever else there is)
This makes it so that we can run stuff in containers and keep control of the data and actually dare to run more important stateful services like databases inside containers, since all data is stored outside of the container environment there's much less risk of dataloss because of screwups in the container service. (Docker in our env has had it's fair share of those)

Other features like integrations with dockers "storage" containers won't get near our persistent data since those introduce quite a bit of complexity (and dependencies on the container service to make sure data is migrated whenever we update the container service, be it docker, rkt or anything similar)

We run services like zookeeper, kafka, mesos, cassandra, haproxy, docker-registry, nginx and similar inside the containers we manage with our service, but we'd like to manage those services/containers fully through nomad instead. which means "system" jobs for most deploys.
Mesos is then used to manage our api/web and similar services, at least for now. To do this the mesos-slave container needs to mount the docker socket and a couple of other paths from the host os into the container as well. Support for things like this is a requirement and this works quite well with our docker setup today.

Since services have quite varying requirements we define roles for hosts with different specs, the requirements of services on the infrastructure-level vary so much that it's not really useful to try and launch stuff fully dynamically on random nodes.
It's for example not the right thing to do to run something cpu-intensive like a compute-task on a node specced to run kafka, (not much cpu or mem but lots of not-so-fast disk), and it's not really the best option to allocate all storage on a compute-node (small amount of not-so-fast-disk) for a cassandra node that won't make use of all cpu but will choke on disk throughput and allocate all available storage making the node and most of the node's cpu unusable for other services.

In our case we don't need or even want any magic for finding or managing storage, we want to tell nomad what servers to run which task on (through system tasks in this case.) All nodes with role/class "cassandra" run cassandra container/service and all those nodes have decently specced storage that we guarantee will be available at the same place on the host. This is also a requirement to be able to monitor diskspace and disk utilization for each class/role of service properly.

Regarding security:
Each cluster in our environment only has one "customer", us. There's no requirements or needs to try to limit access to the host os from inside these containers for us, containers are just a compatibilitylayer in our case. (mesos requires java version X, a service we run requires version Y, aurora's thermos-executor doesn't work with anything else than python 2.7 while we require python 3.5 for some services, the host runs ubuntu 14.04 while we want 16.04 to be able to compile some libraries properly).
If some person has access to deploy through nomad they most likely also has access to become root on the hosts..
The paths that are allowed to be mounted into containers can in our case be limited by the nomad client through a whitelist or similar (can, but doesn't have to), but it's important that we can setup a relatively relaxed whitelist like '/data/*' and not have to explicitly specify every allowed path since this is subject to change and would slow down management / development and similar if it's too strict.

@csawyerYumaed
Copy link
Contributor

I've made a generic workaround to handle docker volumes using the raw_exec driver, available on github here: https://github.com/csawyerYumaed/nomad-docker
It handles cleaning up after itself (stopping container/etc). It's not perfect, but it seems to do the trick for now.

@carlanton
Copy link

If you want to use Docker bind mounts in Nomad but still want to use the docker driver, you should totally check out this new as-good-as-ready-for-production tool I just made: https://github.com/carlanton/nomad-docker-wrapper
It wraps the Docker socket with a new socket that allow you to specify bind mounts as environment variables. Still hacky, but just a bit less hacky than using raw_exec :)

@diptanu
Copy link
Contributor

diptanu commented Sep 22, 2016

We are going to start working on volume plugins in the next Nomad release. But in the interim(in the upcoming 0.5 release), we will enable users to pass the volume configuration option in the docker driver configuration.

Also, operators will have to explicitly opt into allowing users to pass the volume/volume driver related configuration option in their jobs by enabling it in Nomad client config.

Users should keep in mind that Nomad won't be responsible for cleaning up things behind the scenes with respect to network based file systems until the support for Nomad's own volume plugins come out.

@far-blue
Copy link

That's great news and a sensible intermediate step

@tlvenn
Copy link

tlvenn commented Sep 29, 2016

@diptanu is there any chance to bring that to rkt as well ?

@w-p
Copy link

w-p commented Oct 26, 2016

Is there a schedule attached to that release by chance? Hard to sell people on Nomad without mounts.

@diptanu
Copy link
Contributor

diptanu commented Oct 26, 2016

@w-p We are trying to do the RC release this week, and main release next week.

@w-p
Copy link

w-p commented Nov 2, 2016

Thanks for getting the RC out.

@ekarlso
Copy link

ekarlso commented Nov 11, 2016

Is there any support now for doing stuff like MySQL with persistent data volumes?

@donovanmuller
Copy link

@ekarlso It looks like 0.5 (currently at 0.5.0-rc2) supports both Docker (driver/docker: Support Docker volumes [GH-1767]) and rkt volumes.

@kaskavalci
Copy link
Contributor

@diptanu, do you have any milestone or ETA for volume drivers?

@far-blue
Copy link

I believe they are now supported. You can pass, in the docker config section of the job spec, an array of strings with the same format you would use in the docker run -v command.

@erickhan
Copy link

erickhan commented Mar 2, 2017

If the crux of this issue is Docker volume driver, I think you guys addressed it with the recent PR*.

If it's about extending the resource model, I'd suggest that'll take quite some time and maybe become it's own design. Defer to others, as I'm just learning about Nomad myself. thanks!

*#2351

@c4milo
Copy link
Contributor

c4milo commented May 22, 2017

@dadgar is this one on track for 0.6.0?

@dadgar
Copy link
Contributor

dadgar commented May 24, 2017

@c4milo No, this isn't being tackled in 0.6.0

@maticmeznar
Copy link

Since Nomad 0.7.0, what is the recommended best practice for running a database Docker container that requires a persistent data volume? ephemeral_disk does not offer any guarantee and only works if the database is clustered. Should constraint be used to lock the job to a specific node and then use volumes Docker driver option?

@alexey-guryanov
Copy link

@maticmeznar, I cannot speak for "recommended best practice", and there is more than one way to achieve it, but I can share an approach that we are using at the moment.
When we want a persistent storage for anything running in managed (by Nomad in this case) Docker container, we decided that we want this storage to be redundant on its own (regardless of the content we put there), and available on all Nomad nodes, so a particular Docker container can be rescheduled to another node in Nomad cluster and still access the same data.

That can be achieved in more than one way, for instance, there is REX-Ray and solutions alike, that look attractive for using a cloud provider storage (like AWS S3, Google Cloud Storage, etc.), but we haven't tried it.

What we are using at the moment is a separate distributed replicated storage cluster (we use GlusterFS at the moment, there are alternatives), mounting GlusterFS volume(s) on each node in Nomad cluster, and mapping an appropriate folder from mounted volume into Docker container.
For instance:

  • mount some GlusterFS volume as /shared_data on all nodes in Nomad cluster
  • create a folder in there for a particular application, say /shared_data/some_app_postgresql
  • define a volume in Nomad job specification:
job "some_app" {
    group "some_app_db" {
        task "some_app_db" {
            driver = "docker"
            config {
                image = "some-postgresql-image"
                volumes = [
                    "/shared_data/some_app_postgresql:/var/lib/postgresql/data/pgdata"
                ]
            }
        }
    }
}

Again, there are multiple ways to go about data persitency with managed Docker containers, hope our perspective may be helpful to somebody.

@moritzheiber
Copy link
Contributor

The absence of a proper solution to volume management with Nomad is literally the only reason I cannot recommend it to our clients and/or use it instead of Kubernetes. Its Vault and Consul integration, ease of use, minimal installation overhead and workload support is intriguing, but it all doesn't matter because it cannot be trusted with persistent data 😞

I wish this was higher up the product backlog.

@ketzacoatl
Copy link
Contributor

ketzacoatl commented Feb 5, 2018 via email

@jsilberm
Copy link
Contributor

jsilberm commented Feb 5, 2018

If you look at this thread's origin --- "Nomad should have some way for tasks to acquire persistent storage on nodes." --- it doesn't say that Nomad itself should procure/acquire the persistent storage, only that the task should have a way.

One way is through "container storage on demand". Assuming use of the Nomad 'docker' driver, if the volume-driver plugin can present relevant meta-data at run-time, then it's possible for the storage to be provisioned on-demand when the task starts.

Here's what this might look like:

task "my-app" {
      driver = "docker"
      config {
        image = "myapp/my-image:latest"
        volumes = [
          "name=myvol,size=10,repl=3:/mnt/myapp",
        ]
        volume_driver = "pxd"
    }

In this case, a 10GB volume named "myvol" gets created, with synchronous replication on 3 nodes and is mapped into the container at "/mnt/myapp". The task acquires the persistent storage.

This capability is available today through the Portworx volume-driver plugin, as documented here: https://docs.portworx.com/scheduler/nomad/install.html

(*) disclaimer: I work at Portworx.

@iwvelando
Copy link

iwvelando commented Apr 5, 2018

Hello, I've seen a lot of discussion about persistent storage with Docker containers which I've been using effectively. However I'm also keenly interested in persistent storage for qemu VMs scheduled through nomad. I may have overlooked something but I don't see this as an option.

Is there any expectation of adding this? Or is there any path with existing configuration to achieving some form of persistent storage?

@endocrimes
Copy link
Contributor

👋 Hey Folks,

We're currently planning on implementing support for persistent storage across various task drivers via support for Host Volume Mounts (#5377), and the Container Storage Interface (#5378).

Please follow along with the respective issues for updates as they're available 😄.

@akamac
Copy link

akamac commented Mar 26, 2019

@far-blue @a86c6f7964: I too am using raw_exec + docker-compose as a workaround. The trouble with that is clean-up when one kills a job. When the Nomad executor sends a SIGINT to docker-compose, it does not clean up the containers and volumes by default; you have to explicitly do docker-compose down. For that and other reasons, we have a wrapper shell script to trap SIGINT. There is an outstanding feature request for 'pre-' and 'post-' task hooks. That should help as long as the post-task hooks get run even when it's triggered by nomad stop.

@dvusboy Could you share your wrapper code please?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests