-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ring buffers mapped to userspace #2259
Conversation
fw/ringbuffer.c
Outdated
} | ||
|
||
smp_store_release(&rb->head, head); | ||
smp_mb(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use mem barrier with per-cpu data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it always executes in softirq context, and preemption disabling is excessive. Will get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point is that atomics imply full barriers on x86-64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I answered to the comment about preempt_disable.
So we can do without smp_mb, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can do without smp_mb, don't we?
Yes, we don't need smp_mb
fw/ringbuffer.c
Outdated
if (atomic_read(&rba->unmapped)) | ||
return -EAGAIN; | ||
|
||
preempt_disable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need preempt_disable
as we discussed in private discussion since this function is always executed in softirq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
fw/ringbuffer.c
Outdated
static bool proc_file_is_open; | ||
|
||
int | ||
tfw_ringbuffer_write(TfwStr **strs, unsigned int count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure but it seems that we can use just
int
tfw_ringbuffer_write(TfwStr *str)
because TfwStr can contains a lot of TfwStr as chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a compound TfwStr can't contain compound chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed on the call that this API introduces unwanted overhead on forming an array, so probably we should just provide a raw memory and a commit call, which moves the tail
fw/ringbuffer.c
Outdated
head = rb->head; | ||
tail = smp_load_acquire(&rb->tail); | ||
|
||
for (i = 0; i < count; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we pass single TfwStr with a lot of chunks, we can skip this loop.
fw/ringbuffer.c
Outdated
} | ||
|
||
void | ||
tfw_ringbuffer_test_set_unmapped(int unmapped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this function is used only for test purpose, maybe move it to unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this function we change a field of an internal object.
OK, maybe I can move this field to TfwRingbufer structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after cleanups mentioned by @const-t and @EvgeniiMekhanik . But please also have a look onto other RB implementations - maybe we can get ready to use implementation. For now the implementation lacks the file operations, but later we might face a bigger problem to let user-space threads sleep on no events and make it low overhead
fw/ringbuffer.h
Outdated
* to be copied between the kernel and user space. Instead, user-space threads | ||
* can directly access the data in the kernel’s memory, greatly improving | ||
* performance by avoiding the overhead of traditional system calls and memory | ||
* copying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a good explanation what's going on in the subsystem.
Having that the kernel already has plenty of ringbuffers, could you please describe why didn't we use any of them, why didn't we port the prototype of a generic ring buffer? RBs to consider:
- relay (also see kernel and user space examples) - provides kernel-write-only and user-read-only per-cpu buffers with fixed-size subbufers. In our case, either with this task or Kernel-User Space Transport #77, we have to deal with records of varying length, which may hard to determine subbufers length and use them efficiently. Also both
relay_reserve()
andrelay_write()
require data length knowledge before the call, which will require us to traverse data twice. The implementation provides a sleeping mechanism for the user-space, so the ringbuffer files can bepoll()
'ed. At some point we'll need a sleeping for user space functionality, but for now the sleepable kernel functions can not be used in softirq. - the new generic ringbuffer isn't merged yet and also usese sleepable functions, which can not be used in softirq. Also seems lack of per-cpu mode and this functionality must be done though different files manually.
io_uring
is very powerful, but requires CQ and SQ, which seems just an additional overhead for uspacket_ring_buffer
(net/packet/internal.h
, used in packet mmap) is suitable only for pages (network frames) transmission, not a generic many-per-page transmission (I did only a quick check though)- I didn't check the ring buffer, used in trafecfs, but it seems also for page pointers to store/read.
- Also I had only a quick look onto the perf ring buffer - seems has perf-specific logic embedded into the buffer.
The point is that we need a very good motivation to write our own ring buffer, especially with that we're going to extend it for #77. This research of other implementations is also crucial to understand how can we introduce sleeping functionality for the user-space threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my current version also require data length knowledge before the call.
Maybe we get rid of it if we allow incomplete data blocks. As you mentioned early, we could send fields as TLV and mark the end of the data block with a special "end" TLV. So we could get the size of the room between tail and head before writing, and on every field addition we have to make sure if there is enough place for this field and ending TLV.
Will add the motivation to the comments in the code.
fw/ringbuffer.c
Outdated
} | ||
|
||
smp_store_release(&rb->head, head); | ||
smp_mb(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point is that atomics imply full barriers on x86-64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some logic absent - please address it in a next PR
In #537 we need a way to deliver log data to userspace. Introduce a set of per-cpu ring buffer mapped to userspace. Signed-off-by: Alexander Ivanov <[email protected]>
Signed-off-by: Alexander Ivanov <[email protected]>
Signed-off-by: Alexander Ivanov <[email protected]>
__alloc_percpu_gfp() is not required, we can use alloc_percpu_gfp(). Signed-off-by: Alexander Ivanov <[email protected]>
Signed-off-by: Alexander Ivanov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented some issues, but there are also pending reviews from @const-t and @EvgeniiMekhanik , so I just approve since there is no any design issues, and leave the rest for your fixes and reviews.
void | ||
tfw_mmap_buffer_get_room(TfwMmapBufferHolder *holder, | ||
char **part1, unsigned int *size1, | ||
char **part2, unsigned int *size2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identation problem, please :set tabstop=8
in your vim config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
|
||
*size2 = 0; | ||
|
||
if (!atomic_read(&buf->is_ready)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!atomic_read(&buf->is_ready)) { | |
if (unlikely(!atomic_read(&buf->is_ready))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
head = buf->head % buf->size; | ||
tail = smp_load_acquire(&buf->tail) % buf->size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buf->size
isn't a power of 2, so compiler can't optimize the code to and
and div
is the slowest CPU instruction, which we call here twice. We can loose one page to get aligned RB pointers or you can add 32
on wrapping up.
This division is on data plane, so we're not OK with it as with division in dev_file_mmap()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is - we allocate the buffer by alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, order)
, and there is two bad way: we use division or we lose almost a half of the buffer size (minus header size).
Maybe we can use another function for allocation which doesn't restrict memory size by a power of 2. But I haven't found appropriate one.
I don't know... Maybe we can use __alloc_percpu and get struct page pointers walking through every allocated page. I considered this way but it looked a little bit ugly for me. Is there another ways?
EXPECT_EQ(r, expect_wr); | ||
if (r) | ||
return; | ||
tfw_mmap_buffer_commit(holder, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API looks good and I didn't get from #2259 (comment)
Actually my current version also require data length knowledge before the call.
It seems now we get as much memory as RB has and can write as much data as we need (and can), so there is no need to compute the size of all written data upfront.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in this variant we don't need to compute whole event size before writing.
int i; | ||
|
||
for (i = 0; i < holders_cnt; ++i) { | ||
if (strcmp(holders[i]->dev_name, (char *)filp->f_path.dentry->d_iname) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too wide line: we that's OK to slightly exceed 80 characters per line, but not so much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
device_create(holder->dev_class, NULL, | ||
MKDEV(holder->dev_major, 0), NULL, filename); | ||
strscpy(holder->dev_name, filename, sizeof(holder->dev_name)); | ||
holders[holders_cnt++] = holder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
holders_cnt
can exceed MAX_HOLDERS
int dev_major; | ||
struct class *dev_class; | ||
struct page *pg[]; | ||
} TfwMmapBufferHolder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style: could you please align the data structure members and adjust comments like for example in https://github.com/tempesta-tech/tempesta/blob/master/fw/cache.c#L129 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
holder = kmalloc(sizeof(TfwMmapBufferHolder) + | ||
sizeof(struct page *) * num_online_cpus(), | ||
GFP_KERNEL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you need to zeroize the memory (or GFP_ZERO) since you check holder->pg[cpu]
for NULL
in tfw_mmap_buffer_free()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw it and replaced by kzalloc.
return -EAGAIN; | ||
|
||
vma->vm_ops = &dev_vm_ops; | ||
(void)dev_vm_ops; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I din't get the sense of (void)dev_vm_ops
statement...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't notice, left after debugging.
#undef NTH_ONLINE_CPU | ||
} | ||
|
||
static void dev_file_vm_close(struct vm_area_struct *vma) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static void dev_file_vm_close(struct vm_area_struct *vma) | |
static void | |
dev_file_vm_close(struct vm_area_struct *vma) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Should we just close it in favor of #2273 ? |
return NULL; | ||
|
||
holder = kmalloc(sizeof(TfwMmapBufferHolder) + | ||
sizeof(struct page *) * num_online_cpus(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken aligment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used 4-spaces tabs. Fixed.
int cpu; | ||
|
||
if (size < TFW_MMAP_BUFFER_MIN_SIZE | ||
|| size > TFW_MMAP_BUFFER_MAX_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken aligment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used 4-spaces tabs. Fixed.
fw/mmap_buffer.c
Outdated
|
||
holder->dev_major = -1; | ||
holder->buf = __alloc_percpu_gfp(sizeof(TfwMmapBuffer *), | ||
sizeof(u64), GFP_KERNEL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broken aligment
fw/mmap_buffer.c
Outdated
order = get_order(size); | ||
|
||
holder->dev_major = -1; | ||
holder->buf = __alloc_percpu_gfp(sizeof(TfwMmapBuffer *), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that __alloc_percpu_gfp
can return NULL, so we should handle this case
goto err; | ||
} | ||
|
||
holder->dev_class = class_create(THIS_MODULE, filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class_create can fails, so we should check dev_class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
} | ||
|
||
holder->dev_class = class_create(THIS_MODULE, filename); | ||
device_create(holder->dev_class, NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_create can also fails.so we should handle this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
#undef MAX_SIZE | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add simple test to check device createion.
I add
TEST(tfw_mmap_buffer, create_dev)
{
holder = tfw_mmap_buffer_create("test", TFW_MMAP_BUFFER_MIN_SIZE);
EXPECT_NOT_NULL(holder);
EXPECT_NULL(tfw_mmap_buffer_create("test", TFW_MMAP_BUFFER_MIN_SIZE));
tfw_mmap_buffer_free(holder);
}
And kernel crashes because we don't handle invalid device creation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
if (holder->dev_major > 0) { | ||
device_destroy(holder->dev_class, MKDEV(holder->dev_major, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that deinitialization should be more complicated, because device can be NULL and class can be NULL, if class or device creation fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
atomic_set(&holder->is_freeing, 1); | ||
|
||
for_each_online_cpu(cpu) { | ||
TfwMmapBuffer *buf = *per_cpu_ptr(holder->buf, cpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buf can be NULL here if holder->pg[cpu] = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, order);
fails during buffer creation and kenrel crashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
order = get_order(size); | ||
|
||
holder->dev_major = -1; | ||
holder->buf = (TfwMmapBuffer **)alloc_percpu_gfp(sizeof(TfwMmapBuffer *), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alloc_percpu_gfp
can fails we should handle this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need fixes
Looks through all the comments. All issues are fixed. |
No description provided.