Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ring buffers mapped to userspace #2259

Closed
wants to merge 5 commits into from
Closed

Add ring buffers mapped to userspace #2259

wants to merge 5 commits into from

Conversation

ai-tmpst
Copy link
Contributor

No description provided.

@ai-tmpst ai-tmpst linked an issue Oct 10, 2024 that may be closed by this pull request
fw/ringbuffer.c Outdated
}

smp_store_release(&rb->head, head);
smp_mb();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use mem barrier with per-cpu data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it always executes in softirq context, and preemption disabling is excessive. Will get rid of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is that atomics imply full barriers on x86-64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I answered to the comment about preempt_disable.

So we can do without smp_mb, don't we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can do without smp_mb, don't we?

Yes, we don't need smp_mb

fw/ringbuffer.c Outdated
if (atomic_read(&rba->unmapped))
return -EAGAIN;

preempt_disable();
Copy link
Contributor

@EvgeniiMekhanik EvgeniiMekhanik Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need preempt_disable as we discussed in private discussion since this function is always executed in softirq

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

fw/ringbuffer.c Outdated
static bool proc_file_is_open;

int
tfw_ringbuffer_write(TfwStr **strs, unsigned int count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure but it seems that we can use just

int
tfw_ringbuffer_write(TfwStr *str)

because TfwStr can contains a lot of TfwStr as chunks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a compound TfwStr can't contain compound chunks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed on the call that this API introduces unwanted overhead on forming an array, so probably we should just provide a raw memory and a commit call, which moves the tail

fw/ringbuffer.c Outdated
head = rb->head;
tail = smp_load_acquire(&rb->tail);

for (i = 0; i < count; ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass single TfwStr with a lot of chunks, we can skip this loop.

fw/ringbuffer.c Outdated
}

void
tfw_ringbuffer_test_set_unmapped(int unmapped)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this function is used only for test purpose, maybe move it to unit tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this function we change a field of an internal object.
OK, maybe I can move this field to TfwRingbufer structure.

Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after cleanups mentioned by @const-t and @EvgeniiMekhanik . But please also have a look onto other RB implementations - maybe we can get ready to use implementation. For now the implementation lacks the file operations, but later we might face a bigger problem to let user-space threads sleep on no events and make it low overhead

fw/ringbuffer.h Outdated
* to be copied between the kernel and user space. Instead, user-space threads
* can directly access the data in the kernel’s memory, greatly improving
* performance by avoiding the overhead of traditional system calls and memory
* copying.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a good explanation what's going on in the subsystem.

Having that the kernel already has plenty of ringbuffers, could you please describe why didn't we use any of them, why didn't we port the prototype of a generic ring buffer? RBs to consider:

  • relay (also see kernel and user space examples) - provides kernel-write-only and user-read-only per-cpu buffers with fixed-size subbufers. In our case, either with this task or Kernel-User Space Transport #77, we have to deal with records of varying length, which may hard to determine subbufers length and use them efficiently. Also both relay_reserve() and relay_write() require data length knowledge before the call, which will require us to traverse data twice. The implementation provides a sleeping mechanism for the user-space, so the ringbuffer files can be poll()'ed. At some point we'll need a sleeping for user space functionality, but for now the sleepable kernel functions can not be used in softirq.
  • the new generic ringbuffer isn't merged yet and also usese sleepable functions, which can not be used in softirq. Also seems lack of per-cpu mode and this functionality must be done though different files manually.
  • io_uring is very powerful, but requires CQ and SQ, which seems just an additional overhead for us
  • packet_ring_buffer (net/packet/internal.h, used in packet mmap) is suitable only for pages (network frames) transmission, not a generic many-per-page transmission (I did only a quick check though)
  • I didn't check the ring buffer, used in trafecfs, but it seems also for page pointers to store/read.
  • Also I had only a quick look onto the perf ring buffer - seems has perf-specific logic embedded into the buffer.

The point is that we need a very good motivation to write our own ring buffer, especially with that we're going to extend it for #77. This research of other implementations is also crucial to understand how can we introduce sleeping functionality for the user-space threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my current version also require data length knowledge before the call.
Maybe we get rid of it if we allow incomplete data blocks. As you mentioned early, we could send fields as TLV and mark the end of the data block with a special "end" TLV. So we could get the size of the room between tail and head before writing, and on every field addition we have to make sure if there is enough place for this field and ending TLV.

Will add the motivation to the comments in the code.

fw/ringbuffer.c Outdated Show resolved Hide resolved
fw/ringbuffer.c Outdated
}

smp_store_release(&rb->head, head);
smp_mb();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is that atomics imply full barriers on x86-64

fw/ringbuffer.h Outdated Show resolved Hide resolved
fw/ringbuffer.h Outdated Show resolved Hide resolved
fw/ringbuffer.c Outdated Show resolved Hide resolved
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some logic absent - please address it in a next PR

fw/ringbuffer.c Outdated Show resolved Hide resolved
fw/ringbuffer.c Outdated Show resolved Hide resolved
In #537 we need a way to deliver log data to userspace. Introduce a set of
per-cpu ring buffer mapped to userspace.

Signed-off-by: Alexander Ivanov <[email protected]>
__alloc_percpu_gfp() is not required, we can use alloc_percpu_gfp().

Signed-off-by: Alexander Ivanov <[email protected]>
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented some issues, but there are also pending reviews from @const-t and @EvgeniiMekhanik , so I just approve since there is no any design issues, and leave the rest for your fixes and reviews.

void
tfw_mmap_buffer_get_room(TfwMmapBufferHolder *holder,
char **part1, unsigned int *size1,
char **part2, unsigned int *size2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identation problem, please :set tabstop=8 in your vim config file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE


*size2 = 0;

if (!atomic_read(&buf->is_ready)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!atomic_read(&buf->is_ready)) {
if (unlikely(!atomic_read(&buf->is_ready))) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

head = buf->head % buf->size;
tail = smp_load_acquire(&buf->tail) % buf->size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buf->size isn't a power of 2, so compiler can't optimize the code to and and div is the slowest CPU instruction, which we call here twice. We can loose one page to get aligned RB pointers or you can add 32 on wrapping up.

This division is on data plane, so we're not OK with it as with division in dev_file_mmap().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is - we allocate the buffer by alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, order), and there is two bad way: we use division or we lose almost a half of the buffer size (minus header size).
Maybe we can use another function for allocation which doesn't restrict memory size by a power of 2. But I haven't found appropriate one.
I don't know... Maybe we can use __alloc_percpu and get struct page pointers walking through every allocated page. I considered this way but it looked a little bit ugly for me. Is there another ways?

EXPECT_EQ(r, expect_wr);
if (r)
return;
tfw_mmap_buffer_commit(holder, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks good and I didn't get from #2259 (comment)

Actually my current version also require data length knowledge before the call.

It seems now we get as much memory as RB has and can write as much data as we need (and can), so there is no need to compute the size of all written data upfront.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in this variant we don't need to compute whole event size before writing.

int i;

for (i = 0; i < holders_cnt; ++i) {
if (strcmp(holders[i]->dev_name, (char *)filp->f_path.dentry->d_iname) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too wide line: we that's OK to slightly exceed 80 characters per line, but not so much

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

device_create(holder->dev_class, NULL,
MKDEV(holder->dev_major, 0), NULL, filename);
strscpy(holder->dev_name, filename, sizeof(holder->dev_name));
holders[holders_cnt++] = holder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holders_cnt can exceed MAX_HOLDERS

int dev_major;
struct class *dev_class;
struct page *pg[];
} TfwMmapBufferHolder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: could you please align the data structure members and adjust comments like for example in https://github.com/tempesta-tech/tempesta/blob/master/fw/cache.c#L129 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


holder = kmalloc(sizeof(TfwMmapBufferHolder) +
sizeof(struct page *) * num_online_cpus(),
GFP_KERNEL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you need to zeroize the memory (or GFP_ZERO) since you check holder->pg[cpu] for NULL in tfw_mmap_buffer_free()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw it and replaced by kzalloc.

return -EAGAIN;

vma->vm_ops = &dev_vm_ops;
(void)dev_vm_ops;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I din't get the sense of (void)dev_vm_ops statement...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't notice, left after debugging.

#undef NTH_ONLINE_CPU
}

static void dev_file_vm_close(struct vm_area_struct *vma)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static void dev_file_vm_close(struct vm_area_struct *vma)
static void
dev_file_vm_close(struct vm_area_struct *vma)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@krizhanovsky
Copy link
Contributor

Should we just close it in favor of #2273 ?

return NULL;

holder = kmalloc(sizeof(TfwMmapBufferHolder) +
sizeof(struct page *) * num_online_cpus(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken aligment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used 4-spaces tabs. Fixed.

int cpu;

if (size < TFW_MMAP_BUFFER_MIN_SIZE
|| size > TFW_MMAP_BUFFER_MAX_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken aligment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used 4-spaces tabs. Fixed.

fw/mmap_buffer.c Outdated

holder->dev_major = -1;
holder->buf = __alloc_percpu_gfp(sizeof(TfwMmapBuffer *),
sizeof(u64), GFP_KERNEL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken aligment

fw/mmap_buffer.c Outdated
order = get_order(size);

holder->dev_major = -1;
holder->buf = __alloc_percpu_gfp(sizeof(TfwMmapBuffer *),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that __alloc_percpu_gfp can return NULL, so we should handle this case

goto err;
}

holder->dev_class = class_create(THIS_MODULE, filename);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class_create can fails, so we should check dev_class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

holder->dev_class = class_create(THIS_MODULE, filename);
device_create(holder->dev_class, NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_create can also fails.so we should handle this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


#undef MAX_SIZE
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add simple test to check device createion.
I add

TEST(tfw_mmap_buffer, create_dev)
{
	holder = tfw_mmap_buffer_create("test", TFW_MMAP_BUFFER_MIN_SIZE);
	EXPECT_NOT_NULL(holder);
	EXPECT_NULL(tfw_mmap_buffer_create("test", TFW_MMAP_BUFFER_MIN_SIZE));
	tfw_mmap_buffer_free(holder);
}

And kernel crashes because we don't handle invalid device creation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

if (holder->dev_major > 0) {
device_destroy(holder->dev_class, MKDEV(holder->dev_major, 0));
Copy link
Contributor

@EvgeniiMekhanik EvgeniiMekhanik Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that deinitialization should be more complicated, because device can be NULL and class can be NULL, if class or device creation fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

atomic_set(&holder->is_freeing, 1);

for_each_online_cpu(cpu) {
TfwMmapBuffer *buf = *per_cpu_ptr(holder->buf, cpu);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buf can be NULL here if holder->pg[cpu] = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, order); fails during buffer creation and kenrel crashes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

order = get_order(size);

holder->dev_major = -1;
holder->buf = (TfwMmapBuffer **)alloc_percpu_gfp(sizeof(TfwMmapBuffer *),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alloc_percpu_gfp can fails we should handle this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@EvgeniiMekhanik EvgeniiMekhanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need fixes

@ai-tmpst
Copy link
Contributor Author

ai-tmpst commented Nov 4, 2024

Looks through all the comments. All issues are fixed.
This PR can be closed. Let's continue work in PR 2273.

@ai-tmpst ai-tmpst closed this Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants