Add ring buffers mapped to userspace #2259

ai-tmpst · 2024-10-10T10:29:20Z

No description provided.

const-t · 2024-10-11T09:33:07Z

fw/ringbuffer.c

+	}
+
+	smp_store_release(&rb->head, head);
+	smp_mb();


Why do you use mem barrier with per-cpu data?

You are right, it always executes in softirq context, and preemption disabling is excessive. Will get rid of it.

I think the point is that atomics imply full barriers on x86-64

Oops, I answered to the comment about preempt_disable.

So we can do without smp_mb, don't we?

So we can do without smp_mb, don't we?

Yes, we don't need smp_mb

EvgeniiMekhanik · 2024-10-11T17:34:53Z

fw/ringbuffer.c

+	if (atomic_read(&rba->unmapped))
+		return -EAGAIN;
+
+	preempt_disable();


We don't need preempt_disable as we discussed in private discussion since this function is always executed in softirq

EvgeniiMekhanik · 2024-10-11T17:37:49Z

fw/ringbuffer.c

+static bool proc_file_is_open;
+
+int
+tfw_ringbuffer_write(TfwStr **strs, unsigned int count)


I am not sure but it seems that we can use just

int tfw_ringbuffer_write(TfwStr *str)

because TfwStr can contains a lot of TfwStr as chunks.

I thought a compound TfwStr can't contain compound chunks.

We agreed on the call that this API introduces unwanted overhead on forming an array, so probably we should just provide a raw memory and a commit call, which moves the tail

EvgeniiMekhanik · 2024-10-11T17:38:36Z

fw/ringbuffer.c

+	head = rb->head;
+	tail = smp_load_acquire(&rb->tail);
+
+	for (i = 0; i < count; ++i)


If we pass single TfwStr with a lot of chunks, we can skip this loop.

EvgeniiMekhanik · 2024-10-11T17:42:02Z

fw/ringbuffer.c

+}
+
+void
+tfw_ringbuffer_test_set_unmapped(int unmapped)


It seems that this function is used only for test purpose, maybe move it to unit tests?

In this function we change a field of an internal object.
OK, maybe I can move this field to TfwRingbufer structure.

krizhanovsky

LGTM after cleanups mentioned by @const-t and @EvgeniiMekhanik . But please also have a look onto other RB implementations - maybe we can get ready to use implementation. For now the implementation lacks the file operations, but later we might face a bigger problem to let user-space threads sleep on no events and make it low overhead

krizhanovsky · 2024-10-13T21:41:40Z

fw/ringbuffer.h

+ * to be copied between the kernel and user space. Instead, user-space threads
+ * can directly access the data in the kernel’s memory, greatly improving
+ * performance by avoiding the overhead of traditional system calls and memory
+ * copying.


Quite a good explanation what's going on in the subsystem.

Having that the kernel already has plenty of ringbuffers, could you please describe why didn't we use any of them, why didn't we port the prototype of a generic ring buffer? RBs to consider:

relay (also see kernel and user space examples) - provides kernel-write-only and user-read-only per-cpu buffers with fixed-size subbufers. In our case, either with this task or Kernel-User Space Transport #77, we have to deal with records of varying length, which may hard to determine subbufers length and use them efficiently. Also both relay_reserve() and relay_write() require data length knowledge before the call, which will require us to traverse data twice. The implementation provides a sleeping mechanism for the user-space, so the ringbuffer files can be poll()'ed. At some point we'll need a sleeping for user space functionality, but for now the sleepable kernel functions can not be used in softirq.

the new generic ringbuffer isn't merged yet and also usese sleepable functions, which can not be used in softirq. Also seems lack of per-cpu mode and this functionality must be done though different files manually.

io_uring is very powerful, but requires CQ and SQ, which seems just an additional overhead for us

packet_ring_buffer (net/packet/internal.h, used in packet mmap) is suitable only for pages (network frames) transmission, not a generic many-per-page transmission (I did only a quick check though)

I didn't check the ring buffer, used in trafecfs, but it seems also for page pointers to store/read.

Also I had only a quick look onto the perf ring buffer - seems has perf-specific logic embedded into the buffer.

The point is that we need a very good motivation to write our own ring buffer, especially with that we're going to extend it for #77. This research of other implementations is also crucial to understand how can we introduce sleeping functionality for the user-space threads.

Actually my current version also require data length knowledge before the call.
Maybe we get rid of it if we allow incomplete data blocks. As you mentioned early, we could send fields as TLV and mark the end of the data block with a special "end" TLV. So we could get the size of the room between tail and head before writing, and on every field addition we have to make sure if there is enough place for this field and ending TLV.

Will add the motivation to the comments in the code.

fw/ringbuffer.c

krizhanovsky · 2024-10-14T11:06:06Z

fw/ringbuffer.c

+	}
+
+	smp_store_release(&rb->head, head);
+	smp_mb();


I think the point is that atomics imply full barriers on x86-64

fw/ringbuffer.h

fw/ringbuffer.c

krizhanovsky

There are some logic absent - please address it in a next PR

fw/ringbuffer.c

In #537 we need a way to deliver log data to userspace. Introduce a set of per-cpu ring buffer mapped to userspace. Signed-off-by: Alexander Ivanov <[email protected]>

Signed-off-by: Alexander Ivanov <[email protected]>

__alloc_percpu_gfp() is not required, we can use alloc_percpu_gfp(). Signed-off-by: Alexander Ivanov <[email protected]>

Signed-off-by: Alexander Ivanov <[email protected]>

krizhanovsky

I commented some issues, but there are also pending reviews from @const-t and @EvgeniiMekhanik , so I just approve since there is no any design issues, and leave the rest for your fixes and reviews.

krizhanovsky · 2024-10-26T14:30:20Z

fw/mmap_buffer.c

+void
+tfw_mmap_buffer_get_room(TfwMmapBufferHolder *holder,
+						 char **part1, unsigned int *size1,
+						 char **part2, unsigned int *size2)


Identation problem, please :set tabstop=8 in your vim config file

krizhanovsky · 2024-10-27T15:48:58Z

fw/mmap_buffer.c

+
+	*size2 = 0;
+
+	if (!atomic_read(&buf->is_ready)) {


Suggested change

if (!atomic_read(&buf->is_ready)) {

if (unlikely(!atomic_read(&buf->is_ready))) {

krizhanovsky · 2024-10-27T16:41:45Z

fw/mmap_buffer.c

+	}
+
+	head = buf->head % buf->size;
+	tail = smp_load_acquire(&buf->tail) % buf->size;


buf->size isn't a power of 2, so compiler can't optimize the code to and and div is the slowest CPU instruction, which we call here twice. We can loose one page to get aligned RB pointers or you can add 32 on wrapping up.

This division is on data plane, so we're not OK with it as with division in dev_file_mmap().

The problem is - we allocate the buffer by alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, order), and there is two bad way: we use division or we lose almost a half of the buffer size (minus header size).
Maybe we can use another function for allocation which doesn't restrict memory size by a power of 2. But I haven't found appropriate one.
I don't know... Maybe we can use __alloc_percpu and get struct page pointers walking through every allocated page. I considered this way but it looked a little bit ugly for me. Is there another ways?

krizhanovsky · 2024-10-27T17:01:26Z

fw/t/unit/test_mmap_buffer.c

+	EXPECT_EQ(r, expect_wr);
+	if (r)
+		return;
+	tfw_mmap_buffer_commit(holder, size);


The API looks good and I didn't get from #2259 (comment)

Actually my current version also require data length knowledge before the call.

It seems now we get as much memory as RB has and can write as much data as we need (and can), so there is no need to compute the size of all written data upfront.

Yes, in this variant we don't need to compute whole event size before writing.

krizhanovsky · 2024-10-27T17:04:13Z

fw/mmap_buffer.c

+	int i;
+
+	for (i = 0; i < holders_cnt; ++i) {
+		if (strcmp(holders[i]->dev_name, (char *)filp->f_path.dentry->d_iname) == 0) {


too wide line: we that's OK to slightly exceed 80 characters per line, but not so much

krizhanovsky · 2024-10-27T17:07:12Z

fw/mmap_buffer.c

+		device_create(holder->dev_class, NULL,
+					  MKDEV(holder->dev_major, 0), NULL, filename);
+		strscpy(holder->dev_name, filename, sizeof(holder->dev_name));
+		holders[holders_cnt++] = holder;


holders_cnt can exceed MAX_HOLDERS

krizhanovsky · 2024-10-27T17:09:16Z

fw/mmap_buffer.h

+	int dev_major;
+	struct class *dev_class;
+	struct page *pg[];
+} TfwMmapBufferHolder;


Coding style: could you please align the data structure members and adjust comments like for example in https://github.com/tempesta-tech/tempesta/blob/master/fw/cache.c#L129 ?

krizhanovsky · 2024-10-27T17:18:06Z

fw/mmap_buffer.c

+
+	holder = kmalloc(sizeof(TfwMmapBufferHolder) +
+					 sizeof(struct page *) * num_online_cpus(),
+					 GFP_KERNEL);


It seems you need to zeroize the memory (or GFP_ZERO) since you check holder->pg[cpu] for NULL in tfw_mmap_buffer_free()

Yes, I saw it and replaced by kzalloc.

krizhanovsky · 2024-10-27T17:24:04Z

fw/mmap_buffer.c

+		return -EAGAIN;
+
+	vma->vm_ops = &dev_vm_ops;
+	(void)dev_vm_ops;


I din't get the sense of (void)dev_vm_ops statement...

Didn't notice, left after debugging.

krizhanovsky · 2024-10-27T17:24:29Z

fw/mmap_buffer.c

+#undef NTH_ONLINE_CPU
+}
+
+static void dev_file_vm_close(struct vm_area_struct *vma)


Suggested change

static void dev_file_vm_close(struct vm_area_struct *vma)

static void

dev_file_vm_close(struct vm_area_struct *vma)

krizhanovsky · 2024-11-03T21:21:25Z

Should we just close it in favor of #2273 ?

EvgeniiMekhanik · 2024-11-04T10:16:49Z

fw/mmap_buffer.c

+		return NULL;
+
+	holder = kmalloc(sizeof(TfwMmapBufferHolder) +
+					 sizeof(struct page *) * num_online_cpus(),


Broken aligment

I used 4-spaces tabs. Fixed.

EvgeniiMekhanik · 2024-11-04T10:16:59Z

fw/mmap_buffer.c

+	int cpu;
+
+	if (size < TFW_MMAP_BUFFER_MIN_SIZE
+		|| size > TFW_MMAP_BUFFER_MAX_SIZE


Broken aligment

I used 4-spaces tabs. Fixed.

EvgeniiMekhanik · 2024-11-04T10:17:10Z

fw/mmap_buffer.c

+
+	holder->dev_major = -1;
+	holder->buf = __alloc_percpu_gfp(sizeof(TfwMmapBuffer *),
+									 sizeof(u64), GFP_KERNEL);


broken aligment

EvgeniiMekhanik · 2024-11-04T10:17:39Z

fw/mmap_buffer.c

+	order = get_order(size);
+
+	holder->dev_major = -1;
+	holder->buf = __alloc_percpu_gfp(sizeof(TfwMmapBuffer *),


It seems that __alloc_percpu_gfp can return NULL, so we should handle this case

EvgeniiMekhanik · 2024-11-04T10:22:48Z

fw/mmap_buffer.c

+			goto err;
+		}
+
+		holder->dev_class = class_create(THIS_MODULE, filename);


class_create can fails, so we should check dev_class

EvgeniiMekhanik · 2024-11-04T10:26:57Z

fw/mmap_buffer.c

+		}
+
+		holder->dev_class = class_create(THIS_MODULE, filename);
+		device_create(holder->dev_class, NULL,


device_create can also fails.so we should handle this case

EvgeniiMekhanik · 2024-11-04T10:41:16Z

fw/t/unit/test_mmap_buffer.c

+
+#undef MAX_SIZE
+}
+


We should add simple test to check device createion.
I add

TEST(tfw_mmap_buffer, create_dev) { holder = tfw_mmap_buffer_create("test", TFW_MMAP_BUFFER_MIN_SIZE); EXPECT_NOT_NULL(holder); EXPECT_NULL(tfw_mmap_buffer_create("test", TFW_MMAP_BUFFER_MIN_SIZE)); tfw_mmap_buffer_free(holder); }

And kernel crashes because we don't handle invalid device creation

EvgeniiMekhanik · 2024-11-04T10:43:43Z

fw/mmap_buffer.c

+	}
+
+	if (holder->dev_major > 0) {
+		device_destroy(holder->dev_class, MKDEV(holder->dev_major, 0));


It seems that deinitialization should be more complicated, because device can be NULL and class can be NULL, if class or device creation fails

EvgeniiMekhanik · 2024-11-04T10:52:32Z

fw/mmap_buffer.c

+	atomic_set(&holder->is_freeing, 1);
+
+	for_each_online_cpu(cpu) {
+		TfwMmapBuffer *buf = *per_cpu_ptr(holder->buf, cpu);


buf can be NULL here if holder->pg[cpu] = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, order); fails during buffer creation and kenrel crashes

EvgeniiMekhanik · 2024-11-04T10:54:12Z

fw/mmap_buffer.c

+	order = get_order(size);
+
+	holder->dev_major = -1;
+	holder->buf = (TfwMmapBuffer **)alloc_percpu_gfp(sizeof(TfwMmapBuffer *),


alloc_percpu_gfp can fails we should handle this case

Agree, will fix.

EvgeniiMekhanik

Need fixes

ai-tmpst · 2024-11-04T16:46:08Z

Looks through all the comments. All issues are fixed.
This PR can be closed. Let's continue work in PR 2273.

ai-tmpst requested a review from krizhanovsky October 10, 2024 10:29

ai-tmpst linked an issue Oct 10, 2024 that may be closed by this pull request

Fast access logging for analytics #537

Open

ai-tmpst removed a link to an issue Oct 10, 2024

Fast access logging for analytics #537

Open

krizhanovsky requested a review from EvgeniiMekhanik October 10, 2024 15:27

const-t reviewed Oct 11, 2024

View reviewed changes

EvgeniiMekhanik reviewed Oct 11, 2024

View reviewed changes

krizhanovsky approved these changes Oct 14, 2024

View reviewed changes

krizhanovsky reviewed Oct 14, 2024

View reviewed changes

fw/ringbuffer.h Outdated Show resolved Hide resolved

krizhanovsky reviewed Oct 14, 2024

View reviewed changes

fw/ringbuffer.c Outdated Show resolved Hide resolved

krizhanovsky reviewed Oct 14, 2024

View reviewed changes

fw/ringbuffer.c Outdated Show resolved Hide resolved

krizhanovsky reviewed Oct 14, 2024

View reviewed changes

fw/ringbuffer.c Outdated Show resolved Hide resolved

ai-tmpst added 2 commits October 21, 2024 13:15

Add a ring buffer mapped to userspace

582b89c

In #537 we need a way to deliver log data to userspace. Introduce a set of per-cpu ring buffer mapped to userspace. Signed-off-by: Alexander Ivanov <[email protected]>

Add tests for the ring buffer mapped to userspace

d370695

Signed-off-by: Alexander Ivanov <[email protected]>

ai-tmpst force-pushed the ai-537 branch from c4cbdfb to d370695 Compare October 21, 2024 11:16

ai-tmpst added 3 commits October 21, 2024 13:24

Remove trailing tabs in access_log.c

562010b

Signed-off-by: Alexander Ivanov <[email protected]>

Use alloc_percpu_gfp() in mmap_buffer

1a74ab4

__alloc_percpu_gfp() is not required, we can use alloc_percpu_gfp(). Signed-off-by: Alexander Ivanov <[email protected]>

Fix memory leak of holder->buf in mmap_buffer

0bbc087

Signed-off-by: Alexander Ivanov <[email protected]>

krizhanovsky approved these changes Oct 27, 2024

View reviewed changes

EvgeniiMekhanik reviewed Nov 4, 2024

View reviewed changes

EvgeniiMekhanik requested changes Nov 4, 2024

View reviewed changes

ai-tmpst closed this Nov 4, 2024

	if (!atomic_read(&buf->is_ready)) {
	if (unlikely(!atomic_read(&buf->is_ready))) {

	static void dev_file_vm_close(struct vm_area_struct *vma)
	static void
	dev_file_vm_close(struct vm_area_struct *vma)


		#undef MAX_SIZE
		}

Add ring buffers mapped to userspace #2259

Add ring buffers mapped to userspace #2259

Conversation

ai-tmpst commented Oct 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EvgeniiMekhanik Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krizhanovsky left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krizhanovsky left a comment

Choose a reason for hiding this comment

krizhanovsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krizhanovsky commented Nov 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EvgeniiMekhanik Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EvgeniiMekhanik left a comment

Choose a reason for hiding this comment

ai-tmpst commented Nov 4, 2024

EvgeniiMekhanik Oct 11, 2024 •

edited

Loading

krizhanovsky left a comment •

edited

Loading

EvgeniiMekhanik Nov 4, 2024 •

edited

Loading