Skip to content

Commit

Permalink
Huge pages KEP: support multiple sizes huge pages
Browse files Browse the repository at this point in the history
This change adds support of multiple sizes huge pages on a
container level to support the following use cases:

 - VMs running on a Kubernetes infrastructure (QEMU, libvirt, etc)
 - Applications using more than one huge page size
  • Loading branch information
bart0sh committed Oct 2, 2019
1 parent ac738ac commit 404fb3d
Showing 1 changed file with 100 additions and 21 deletions.
121 changes: 100 additions & 21 deletions keps/sig-node/20190129-hugepages.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ Example applications include:
- Java applications can back the heap with huge pages using the
`-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options.
- packet processing systems (DPDK)
- VMs running on top of Kubernetes infrastructure (libvirt, QEMU, etc.)

Applications can generally use huge pages by calling
- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory
Expand Down Expand Up @@ -212,12 +213,7 @@ If a pod consumes huge pages via `shmget`, it must run with a supplemental group
that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of
this group is outside the scope of this specification.

Initially, a pod may not consume multiple huge page sizes in a single pod spec.
Attempting to use `hugepages-2Mi` and `hugepages-1Gi` in the same pod spec will
fail validation. We believe it is rare for applications to attempt to use
multiple huge page sizes. This restriction may be lifted in the future with
community presented use cases. Introducing the feature with this restriction
limits the exposure of API changes needed when consuming huge pages via volumes.
A pod can consume multiple huge page sizes in a single pod spec.

In order to consume huge pages backed by the `hugetlbfs` filesystem inside the
specified container in the pod, it is helpful to understand the set of mount
Expand All @@ -231,22 +227,53 @@ mount -t hugetlbfs \
```

The proposal recommends extending the existing `EmptyDirVolumeSource` to satisfy
this use case. A new `medium=HugePages` option would be supported. To write
into this volume, the pod must make a request for huge pages. The `pagesize`
argument is inferred from the `hugepages-<hugepagesize>` from the resource
request. If in the future, multiple huge page sizes are supported in a single
pod spec, we may modify the `EmptyDirVolumeSource` to provide an optional page
size. The existing `sizeLimit` option for `emptyDir` would restrict usage to
the minimum value specified between `sizeLimit` and the sum of huge page limits
of all containers in a pod. This keeps the behavior consistent with memory
backed `emptyDir` volumes whose usage is ultimately constrained by the pod
cgroup sandbox memory settings. The `min_size` option is omitted as its not
necessary. The `nr_inodes` mount option is omitted at this time in the same
this use case. A new `medium=HugePages[-<hugepagesize>]` options would be
supported. To write into this volume, the pod must make a request for huge
pages. The `pagesize` argument is inferred from the `hugepages-<hugepagesize>`
from the resource request. The existing `sizeLimit` option for `emptyDir` would
restrict usage to the minimum value specified between `sizeLimit` and the sum of
huge page limits of all containers in a pod. This keeps the behavior consistent
with memory backed `emptyDir` volumes whose usage is ultimately constrained by
the pod cgroup sandbox memory settings. The `min_size` option is omitted as its
not necessary. The `nr_inodes` mount option is omitted at this time in the same
manner it is omitted with `medium=Memory` when using `tmpfs`.

The following is a sample pod that is limited to 1Gi huge pages of size 2Mi. It
can consume those pages using `shmget()` or via `mmap()` with the specified
volume.
The following is a sample pod that is limited to 1Gi huge pages of size 2Mi and
2Gi huge pages of size 1Gi. It can consume those pages using `shmget()` or via
`mmap()` with the specified volume.

```
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
containers:
...
volumeMounts:
- mountPath: /hugepages-2Mi
name: hugepage-2Mi
- mountPath: /hugepages-1Gi
name: hugepage-1Gi
resources:
requests:
hugepages-2Mi: 1Gi
hugepages-1Gi: 2Gi
limits:
hugepages-2Mi: 1Gi
hugepages-1Gi: 2Gi
volumes:
- name: hugepage-2Mi
emptyDir:
medium: HugePages-2Mi
- name: hugepage-1Gi
emptyDir:
medium: HugePages-1Gi
```

The following is an example of a pod backward compatible with the
current implementation. It uses `medium: HugePages` notation and
requests hugepages of one size.

```
apiVersion: v1
Expand All @@ -270,6 +297,58 @@ spec:
medium: HugePages
```

This is an example of an invalid pod that requests hugepages of two
differfent sizes, but doesn't use `medium: Hugepages-<size>` notation.

```
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
containers:
...
volumeMounts:
- mountPath: /hugepages
name: hugepage
resources:
requests:
hugepages-2Mi: 1Gi
hugepages-1Gi: 2Gi
limits:
hugepages-2Mi: 1Gi
hugepages-1Gi: 2Gi
volumes:
- name: hugepage
emptyDir:
medium: HugePages
```

This is another example of an invalid pod. It requests hugepages of 2Mi
size, but specifies 1Gi in `medium: HugePages-1Gi`.

```
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
containers:
...
volumeMounts:
- mountPath: /hugepages
name: hugepage
resources:
requests:
hugepages-2Mi: 1Gi
limits:
hugepages-2Mi: 1Gi
volumes:
- name: hugepage
emptyDir:
medium: HugePages-1Gi
```

#### CRI Updates

The `LinuxContainerResources` message should be extended to support specifying
Expand Down Expand Up @@ -363,4 +442,4 @@ Beta support for huge pages
### Version 1.14

GA support for huge pages proposed based on feedback from user community
using the feature without issue.
using the feature without issue.

0 comments on commit 404fb3d

Please sign in to comment.