You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It should only affect the HDFS service of our Hadoop image. The 3 HDFS daemons that stores data that require persistence and their associated property in hdfs-site.xml:
HDFS Name Node : dfs.namenode.name.dir (default: file://${hadoop.tmp.dir}/dfs/name)
HDFS Secondary Name Node : dfs.namenode.checkpoint.dir (default: file://${hadoop.tmp.dir}/dfs/namesecondary)
HDFS Data Node : dfs.datanode.data.dir (default: file://${hadoop.tmp.dir}/dfs/data)
Other data
The other properties that uses the hadoop.tmp.dir property as a variable:
Most of these directories store temporary intermediate data, or are related to the map reduce framework. Since, we won't support map reduce for phase one, it is probably ok to keep the default values for these properties.
I've experimented a bit with the VOLUME instruction (in the Dockerfile), and here's how it works:
When the container is created, 2 read-write layers are created: 1 for the container data (as usual), and 1 for the volume.
Another container on the same host can share that container's volume by using docker run --volumes-from.
The original container can be safely destroyed, and the volume won't be deleted.
When all containers using the volume are deleted, then the volume becomes unavailable.
As stated in the docs, if one deletes the container and forgets to use docker rm -v, then the volume data is NOT deleted If you remove containers without using the -v option, you may end up with "dangling" volumes; volumes that are no longer referenced by a container. Dangling volumes are difficult to get rid of and can take up a large amount of disk space. We're working on improving volume management and you can check progress on this in pull request moby/moby#8484
To go around this limitation, Docker recommends using the Data Volume Container pattern, which consist of creating a container for the sole purpose of keeping a reference to the volume layer.
Approach 2: Mounting a volume from the host
An alternative to the VOLUME instruction would be not to use the VOLUME instruction, and instead mount a volume from the host when creating the container using docker run -v /path-in-the-host:/path-in-the-container.
When using this approach, docker will not create an additional layer and the data can still be shared among containers on the same host. It's probably even more efficient in terms of performance.
The only benefit we're missing from the first approach is that the VOLUME instruction is quite explicit. It helps inform users of the directories that needs persistence.
Approach 3: best of both worlds
It turns out that we can use the VOLUME instruction in the Dockerfile AND override it at container creation time using the '-v' parameter of docker run. When doing so, Docker will NOT create a layer for the volume, since it can use the mount point from the host.
And a user wishing to use the Data Volume design pattern from the 1st approach is still feasible.
As described in issue #13, we need to persist our container data using docker volumes.
Docker Volume Documentation
Data that needs to be persisted
It should only affect the HDFS service of our Hadoop image. The 3 HDFS daemons that stores data that require persistence and their associated property in hdfs-site.xml:
dfs.namenode.name.dir
(default:file://${hadoop.tmp.dir}/dfs/name
)dfs.namenode.checkpoint.dir
(default:file://${hadoop.tmp.dir}/dfs/namesecondary
)dfs.datanode.data.dir
(default:file://${hadoop.tmp.dir}/dfs/data
)Other data
The other properties that uses the
hadoop.tmp.dir
property as a variable:core-site.xml
:io.seqfile.local.dir = ${hadoop.tmp.dir}/io/local
fs.s3.buffer.dir = ${hadoop.tmp.dir}/s3
fs.s3a.buffer.dir = ${hadoop.tmp.dir}/s3a
yarn-site.xml
:yarn.resourcemanager.fs.state-store.uri = ${hadoop.tmp.dir}/yarn/system/rmstore
yarn.nodemanager.local-dirs = ${hadoop.tmp.dir}/nm-local-dir
yarn.nodemanager.recovery.dir = ${hadoop.tmp.dir}/yarn-nm-recovery
yarn.timeline-service.leveldb-timeline-store.path = ${hadoop.tmp.dir}/yarn/timeline
mapred-site.xml
:mapreduce.cluster.local.dir = ${hadoop.tmp.dir}/mapred/local
mapreduce.jobtracker.system.dir = ${hadoop.tmp.dir}/mapred/system
mapreduce.jobtracker.staging.root.dir = ${hadoop.tmp.dir}/mapred/staging
mapreduce.cluster.temp.dir = ${hadoop.tmp.dir}/mapred/staging
mapreduce.jobhistory.recovery.store.fs.uri = ${hadoop.tmp.dir}/mapred/history/recoverystore
Most of these directories store temporary intermediate data, or are related to the map reduce framework. Since, we won't support map reduce for phase one, it is probably ok to keep the default values for these properties.
Proposal
To format the namenode (run this only once):
To run the namenode:
The text was updated successfully, but these errors were encountered: