-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't assume cache line sizes are fixed during program execution #8
Comments
This is a good point. Because of migration, you do want an atomic way of determining how much data was operated on. Reading a CSR and executing a CMO is not atomic, even if you read the CSR before, after, or both. I would generally advocate for returning the number of bytes operated on, independent of the start address, however. This allows a simple implementation to return a constant rather than put an incrementor/adder in the result path. For most designs (?), the result would represent an aligned-block of memory that contains the address (but I suppose some SW might want a different interpretation?). SW would be responsible fixing up the next address (most likely just adding the return value). This works for certain use cases, but wouldn't work if you wanted to operate on a precise number of bytes (e.g. memzero). |
I have a lengthy proposal both on instruction shape and on a way to deal with time-variant cache line sizes that will be getting dropped in the group a bit later this week (a quick pass is being done now to remove spelling/grammar mistakes). The proposal provides for time-variance where it's safe and well defined to do so. Look for this proposal sometime this week. |
Additionally, the cache line size / number of bytes affected may also differ at different levels of the cache hierarchy, or different caches at the same level (e.g. Instruction versus Data caches). I wonder whether a CMO returning the number of bytes affected is too CISC-y. Could software instead do the Right Thing (TM), i.e. functionally correct but simpler and slower, if the cache line size discovery mechanism returned a mask of all the cache line sizes in use in the coherence domain? |
It seems like "differing cache lines sizes in a system" overstates the
issue. All the caching agents within a coherence domain need to understand
one common coherence granule that coherence protocol actions are in terms
of (at least for virtually all commercial coherence protocols). Within
that domain there may be caches with larger or smaller line sizes. For
caches with smaller lines sizes, they still need to perform coherence
protocol actions (request, snoops, etc.) in terms of the coherence granule
size (e.g. give up two cache lines when a request or snoop for ownership is
received). For caches with larger line sizes, they either have sectors
equal in size to the coherence granule or they again must privately deal
with the mismatch between their local cache line size and the size of
coherence protocol actions. Put differently, a hart and its cache can
locally perform CMO's of that cache's line size, but all that has to be
locally and privately reconciled with all resulting global coherence
protocol actions being in terms of the coherence granule size.
Where the problem can creep in is when code loops through a series of CMOs
with an initial cache line size stride length, and then that code migrates
to a hart with a smaller cache line size. But if CMOs are instead defined
in terms of the coherence domain-wide coherence granule size, and software
uses a stride length equal to that coherence granule size, then everything
can work out alright. In particular, the hardware of any hart with a
larger or smaller cache size must already understand at a hardware level
the coherence granule size for the coherence domain it is participating in,
and should perform CMOs effectively to that coherence granule size (or
larger).
I expect the counter-argument to all this is that people want to have
non-coherent hart caches that depend on software to manage coherency. Such
as arises with ARM big.Little systems that have non-coherent
instruction caches and potentially differing cache line sizes. But is that
what this group is trying to cater to (especially since RISC-V starts off
with a bias or expectation that hart instruction caches are hardware
coherent)? Versus providing CMO's to handle data sharing between
coherently-caching harts and other non-caching agents (e.g. DMA masters
wanting to do non-coherent I/O to/from memory).
If the answer is the former, then one solution (albeit sub-optimal) could
be for all software to assume the smallest cache line size in the system.
(Or Derek's coming proposal probably has a better solution.) But is this
type of system design, and with differing non-coherent hart cache line
sizes, the tail that's wagging the dog?
OK, I'll stop there - having stirred the pot enough.
Greg
|
Thanks for the thorough reply Greg! Yours are always worth the read.
I worry that this places an additional complexity cost or configuration restriction on composing hardware blocks into a system, and that this is avoidable by exposing the range of cache lines / coherence granules in a system to software.
So yes, it is solvable in hardware, but it is hardware's (and hardware verification's) burden, and I'm brainstorming ways to shift that burden to software (RISC vs CISC again). I'll loosely use ARMv8 terms for a contrived example, with the instruction
I suggest that we augment
I wonder if this portable software (tolerant of multiple cache line sizes) would have worked as a software work-around for the specific case that @brucehoult encountered! Aside: you mentioned I-Cache software coherency, but I think Derek and the J-Extension are leading that, so this |
... not even from one instruction to the next one.
Any scheme where software reads the cache line size from a CSR or queries to OS for it and then remembers it is prone to failure as soon as you have multi-core heterogeneous systems with different cache block sizes and process migration between them.
It doesn't make any difference whether the cache line size is queried once at the start of the program, before every loop, or even immediately before/after the CMO instruction. The process can and eventually will get migrated at exactly the wrong time.
This issue caused actual extremely hard to reproduce and debug system crashes on one model of mobile phone in a previous job. The phone contained standard ARM "LITTLE" cores and company designed "big" cores. The cores had different cache line sizes. When the problem was diagnosed ARM was asked how they dealt with SoCs with cores with different line sizes. Their answer "We don't do that!"
I think it's an entirely reasonable thing to do and should be allowed for in the design of CMOs intended to be used in cores from many organisations over a long period of time.
My suggestion is that the actual CMO instruction should return the number of bytes it operated on for that particular execution -- and hence the amount the pointer should be advanced by.
If the address given for the CMO is in the middle of a cache line then the return value should be the number of bytes in the rest of the cache line, to allow software to align the pointer to the cache line start for the next iteration.
In the case of a destructive operation such as DCBZ the hardware could choose whether to zero the partial block and report that as normal, or the return value could could somehow indicate that nothing was done. Software could then either ignore it (if it doesn't really care whether the contents are zero or not and the line is either already in the cache or else will be fetched as usual when next referenced) or else manually zero those bytes. The most natural way to indicate this might be to return the negation of the number of bytes that should have been operated on but weren't. Or perhaps set the high bit, which would allow an unconditional & 0x7FF or similar for software that doesn't care (would fail if cache lines can ever be 2K or more).
NB this can be needed on any iteration, not only the first, if the process is migrated from a core with a small cache line size to a core with a large cache line size.
The text was updated successfully, but these errors were encountered: