Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't assume cache line sizes are fixed during program execution #8

Open
brucehoult opened this issue Sep 1, 2020 · 5 comments
Open

Comments

@brucehoult
Copy link

... not even from one instruction to the next one.

Any scheme where software reads the cache line size from a CSR or queries to OS for it and then remembers it is prone to failure as soon as you have multi-core heterogeneous systems with different cache block sizes and process migration between them.

It doesn't make any difference whether the cache line size is queried once at the start of the program, before every loop, or even immediately before/after the CMO instruction. The process can and eventually will get migrated at exactly the wrong time.

This issue caused actual extremely hard to reproduce and debug system crashes on one model of mobile phone in a previous job. The phone contained standard ARM "LITTLE" cores and company designed "big" cores. The cores had different cache line sizes. When the problem was diagnosed ARM was asked how they dealt with SoCs with cores with different line sizes. Their answer "We don't do that!"

I think it's an entirely reasonable thing to do and should be allowed for in the design of CMOs intended to be used in cores from many organisations over a long period of time.

My suggestion is that the actual CMO instruction should return the number of bytes it operated on for that particular execution -- and hence the amount the pointer should be advanced by.

If the address given for the CMO is in the middle of a cache line then the return value should be the number of bytes in the rest of the cache line, to allow software to align the pointer to the cache line start for the next iteration.

In the case of a destructive operation such as DCBZ the hardware could choose whether to zero the partial block and report that as normal, or the return value could could somehow indicate that nothing was done. Software could then either ignore it (if it doesn't really care whether the contents are zero or not and the line is either already in the cache or else will be fetched as usual when next referenced) or else manually zero those bytes. The most natural way to indicate this might be to return the negation of the number of bytes that should have been operated on but weren't. Or perhaps set the high bit, which would allow an unconditional & 0x7FF or similar for software that doesn't care (would fail if cache lines can ever be 2K or more).

NB this can be needed on any iteration, not only the first, if the process is migrated from a core with a small cache line size to a core with a large cache line size.

@dkruckemyer-ventana
Copy link
Collaborator

This is a good point. Because of migration, you do want an atomic way of determining how much data was operated on. Reading a CSR and executing a CMO is not atomic, even if you read the CSR before, after, or both.

I would generally advocate for returning the number of bytes operated on, independent of the start address, however. This allows a simple implementation to return a constant rather than put an incrementor/adder in the result path. For most designs (?), the result would represent an aligned-block of memory that contains the address (but I suppose some SW might want a different interpretation?). SW would be responsible fixing up the next address (most likely just adding the return value).

This works for certain use cases, but wouldn't work if you wanted to operate on a precise number of bytes (e.g. memzero).

@strikerdw
Copy link

I have a lengthy proposal both on instruction shape and on a way to deal with time-variant cache line sizes that will be getting dropped in the group a bit later this week (a quick pass is being done now to remove spelling/grammar mistakes). The proposal provides for time-variance where it's safe and well defined to do so.

Look for this proposal sometime this week.

@ingallsj
Copy link

Additionally, the cache line size / number of bytes affected may also differ at different levels of the cache hierarchy, or different caches at the same level (e.g. Instruction versus Data caches).

I wonder whether a CMO returning the number of bytes affected is too CISC-y.

Could software instead do the Right Thing (TM), i.e. functionally correct but simpler and slower, if the cache line size discovery mechanism returned a mask of all the cache line sizes in use in the coherence domain?

@gfavor
Copy link

gfavor commented Sep 15, 2020 via email

@ingallsj
Copy link

ingallsj commented Sep 15, 2020

Thanks for the thorough reply Greg! Yours are always worth the read.

the hardware of any hart with a
larger or smaller cache size must already understand at a hardware level
the coherence granule size for the coherence domain it is participating in,
and should perform CMOs effectively to that coherence granule size

I worry that this places an additional complexity cost or configuration restriction on composing hardware blocks into a system, and that this is avoidable by exposing the range of cache lines / coherence granules in a system to software.

  • Sure, if the same IP block hard macro can be instantiated with same-or-different next-level cache line size, then add inputs to it.
  • Sure, if a chip can be hot-plugged to another chip with same-or-different cross-chip cache line size, then either assume it's always there, or discover and program it on the fly.

So yes, it is solvable in hardware, but it is hardware's (and hardware verification's) burden, and I'm brainstorming ways to shift that burden to software (RISC vs CISC again).

I'll loosely use ARMv8 terms for a contrived example, with the instruction DCZVA writing to a NAPOT memory block of size reported by DCZID.
Greg, you suggest:

  1. In a system with DCZID reporting 64-byte memory block size but running on hardware with underlying 32-byte cache line size, hardware executing a DCZVA instruction would need to zero two cache lines. Do-able, but making a single instruction span multiple cache lines isn't free.
  2. In a system with DCZID reporting 32-byte memory block size but running on hardware with underlying 64-byte cache line size, hardware executing a DCZVA instruction would need to zero a sub-cache-line sector. Do-able, but tracking cache sectors isn't free.

I suggest that we augment DCZID to report both 32- and 64-byte memory block and cache line sizes in the system, and the DCZVA instruction is defined to operate NAPOT on at least 32 bytes and at most 64 bytes. All of these instruction sequences should be agnostic of whether they are operating on either the 32-byte or 64-byte cache lines.

  1. Software wishing to zero 32 bytes would see that range as smaller than 64 bytes (the maximum cache line size), and execute store instructions:
STR XZR, [#0x00]
STR XZR, [#0x08]
STR XZR, [#0x10]
STR XZR, [#0x18]
  1. Software wishing to zero 64 bytes would see that range as evenly divisible by 64 bytes (the maximum cache line size), and execute DCZVA instructions at 32-byte granularity (the smallest cache line size). This is relying on hardware with underlying 64-byte cache line size gathering the back-to-back DCZVA instructions, which isn't free, but is easier than spanning cache lines or tracking sectors.
DCZVA [#0x00]
DCZVA [#0x20]  ; this is redundant on 64-byte cache lines and should gather combine with the previous instruction
  1. Software wishing to zero 96 bytes would execute DCZVA instructions at 32-byte granularity (the smallest cache line size) for the first 64 bytes (the maximum cache line size), then execute stores for the remainder.
DCZVA [#0x00]
DCZVA [#0x20]
STR XZR, [#0x40]
STR XZR, [#0x48]
STR XZR, [#0x50]
STR XZR, [#0x58]

I wonder if this portable software (tolerant of multiple cache line sizes) would have worked as a software work-around for the specific case that @brucehoult encountered!

Aside: you mentioned I-Cache software coherency, but I think Derek and the J-Extension are leading that, so this riscv-CMOs group is focusing on CMOs to handle data sharing, per https://github.com/riscv/riscv-CMOs/wiki/CMOs-WG-Draft-Proposed-Charter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants