Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thaughts on improving section cache effect and false sharing #9

Open
idoleat opened this issue May 18, 2024 · 1 comment
Open

Thaughts on improving section cache effect and false sharing #9

idoleat opened this issue May 18, 2024 · 1 comment

Comments

@idoleat
Copy link
Collaborator

idoleat commented May 18, 2024

Benchmark atomic instructions latency and data through put

We provide a benchmark program to

  1. Show how cache coherence costs performance.
  2. Show that atomics guarantee forward progress but not necessarily faster than locks.
  3. Pave the way for trying out other mechanism to lower cache line contention

Considering how each micro architecture differs on atomic insn implementation, core topology and cache coherence protocol, the benchmark should run on a wide variety of hardware platforms. May be we can invite volunteers to benchmark.

I am looking into Evaluating the Cost of Atomic Operations on Modern Architectures and its citations, examining how we can conduct the benchmark.

False sharing example

We provide an example showing how false sharing affect performance. Benchmark provided by Zeosleus or examples in previous sections could be used.

Before working on this proposal, I will add example for discussing ABA problem in section 6 first and HTML export as well.

@idoleat
Copy link
Collaborator Author

idoleat commented Jun 20, 2024

Considering how each micro architecture differs on atomic insn implementation

I think this should be a separated subsection since it could be a long story on how compilers and different architectures ensure how an operation is logically atomic.

An overview of the subsection could be like

  1. Explain cache coherence protocols.
  2. Explain on x86, LOCK signal is used in cache coherence protocol to lock the cache line to ensure atomicity. On RISC-V, it has LR/SC and AMOs (could be implemented using LR/SC or in memory controllers) to do so, then FENCE and aq/rl bits for additional ordering. Arm has similar things as well. (I need to acquire a greater depth of knowledge on this)
  3. How compilers choose functionalities provided by the processor to form C/C++ 11 memory model, such as selecting atomic instructions, adding LOCK prefix or generating LR/SC loop, CAS loop.

The benchmark result could thus reflect the performance and scalability on each implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant