Skip to content

Commit

Permalink
mm: support vector address ranges for process_madvise
Browse files Browse the repository at this point in the history
This patch changes process_madvise interface:

  a) support vector address ranges in a system call
  b) support the vector address ranges to local process as well as
     external process
  c) remove pid but keep only pidfd in argument - [1][2]
  d) change type of flags with unsgined int

Android app has thousands of vmas due to zygote so it's totally waste of
CPU and power if we should call the syscall one by one for each vma.
(With testing 2000-vma syscall vs 1-vector syscall, it showed 15%
performance improvement.  I think it would be bigger in real practice
because the testing ran very cache friendly environment).

Another potential use case for the vector range is to amortize the cost of
TLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations.  In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment.  With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.

So finally, the API is as follows,

      ssize_t process_madvise(int pidfd, const struct iovec *iovec,
      		unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
      The process_madvise() system call is used to give advice or directions
      to the kernel about the address ranges from external process as well as
      local process. It provides the advice to address ranges of process
      described by iovec and vlen. The goal of such advice is to improve system
      or application performance.

      The pidfd selects the process referred to by the PID file descriptor
      specified in pidfd. (See pidofd_open(2) for further information)

      The pointer iovec points to an array of iovec structures, defined in
      <sys/uio.h> as:

        struct iovec {
            void *iov_base;         /* starting address */
            size_t iov_len;         /* number of bytes to be advised */
        };

      The iovec describes address ranges beginning at address(iov_base)
      and with size length of bytes(iov_len).

      The vlen represents the number of elements in iovec.

      The advice is indicated in the advice argument, which is one of the
      following at this moment if the target process specified by pidfd is
      external.

        MADV_COLD
        MADV_PAGEOUT
        MADV_MERGEABLE
        MADV_UNMERGEABLE

      Permission to provide a hint to external process is governed by a
      ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

      The process_madvise supports every advice madvise(2) has if target
      process is in same thread group with calling process so user could
      use process_madvise(2) to extend existing madvise(2) to support
      vector address ranges.

    RETURN VALUE
      On success, process_madvise() returns the number of bytes advised.
      This return value may be less than the total number of requested
      bytes, if an error occurred. The caller should check return value
      to determine whether a partial advice occurred.

[1] https://lore.kernel.org/linux-mm/20200509124817.xmrvsrq3mla6b76k@wittgenstein/
[2] https://lore.kernel.org/linux-mm/[email protected]/

Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Arjun Roy <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: Daniel Colascione <[email protected]>
Cc: Sonny Rao <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: John Dias <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Oleksandr Natalenko <[email protected]>
Cc: Sandeep Patil <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
  • Loading branch information
minchank authored and sfrothwell committed Jun 5, 2020
1 parent 364c32a commit 90a50b7
Showing 1 changed file with 40 additions and 7 deletions.
47 changes: 40 additions & 7 deletions mm/madvise.c
Original file line number Diff line number Diff line change
Expand Up @@ -1212,20 +1212,39 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return do_madvise(current, current->mm, start, len_in, behavior);
}

SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start,
size_t, len_in, int, behavior, unsigned long, flags)
static int do_process_madvise(struct task_struct *target_task,
struct mm_struct *mm, struct iov_iter *iter, int behavior)
{
int ret;
struct iovec iovec;
int ret = 0;

while (iov_iter_count(iter)) {
iovec = iov_iter_iovec(iter);
ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base,
iovec.iov_len, behavior);
if (ret < 0)
break;
iov_iter_advance(iter, iovec.iov_len);
}

return ret;
}

SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
const struct iovec __user *, vec, unsigned long, vlen,
int, behavior, unsigned long, flags)
{
ssize_t ret;
struct pid *pid;
struct task_struct *task;
struct mm_struct *mm;
struct iovec iovstack[UIO_FASTIOV];
struct iovec *iov = iovstack;
struct iov_iter iter;

if (flags != 0)
return -EINVAL;

if (!process_madvise_behavior_valid(behavior))
return -EINVAL;

switch (which) {
case P_PID:
if (upid <= 0)
Expand Down Expand Up @@ -1253,13 +1272,27 @@ SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start,
goto put_pid;
}

if (task->mm != current->mm &&
!process_madvise_behavior_valid(behavior)) {
ret = -EINVAL;
goto release_task;
}

mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
if (IS_ERR_OR_NULL(mm)) {
ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
goto release_task;
}

ret = do_madvise(task, mm, start, len_in, behavior);
ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
if (ret >= 0) {
size_t total_len = iov_iter_count(&iter);

ret = do_process_madvise(task, mm, &iter, behavior);
if (ret >= 0)
ret = total_len - iov_iter_count(&iter);
kfree(iov);
}
mmput(mm);
release_task:
put_task_struct(task);
Expand Down

0 comments on commit 90a50b7

Please sign in to comment.