-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cases for CMOs - collect #15
Comments
I've talked to our firmware guys about their cache management use cases. Here's what I've gathered: The typical cache operations that we use are related to data cache when accessing shared memory with The FW would perform writes to the memory, and then Flush Cache to push the dirty writes before
The same thing occurs when HW updates the memory and FW reads it. In that case FW would do To add a little more, FW use cases are:
The importance and frequency of these uses cases has some dependencies on the existence of coherency logic [Our current processor] supports 2 modes:
Today we invalidate by address and range with a FW routine (See API below). What we do within that function
It would be best for our use case to be able to specific the Start and Length of the range and have the HW Just to be explicit to your questions:
|
I've used CMOs to assist coherence in hybrid systems, where some processors are coherent but accelerators are not coherent. In this scenario, correctness is an imperative, but these applications are using accelerators which means performance is also vital. (Without performance, the value-add of the accelerator is significantly weakened.) The accelerator may be closely coupled to the processor, even sharing an L2 or a LLC. Alternatively, it may be located more remotely across an NoC. Cases: A) range-based WRITEBACK of dirty data from processor cache to a level of cache that may be shared with the accelerator; if none, then writeback to main memory. B) range-based EVICT of any clean or dirty data from processor caches, all the way down to (but excluding) the level of cache shared with the accelerator. C) range-based WRITEBACK + EVICT (combination of A and B) Since this is done by the application, ranges are based on virtual addresses. The virtual address ranges may span large blocks of data (eg, large matrices), even though only a small fraction of it may be held in the cache (eg, 4GB matrix and 32kB primary cache). Case A ensures the accelerator will read the latest data. The processor may have modified or initialized the data or a portion of the data. It can be difficult to track which portions of the data have been modified, so large ranges are the norm (to be safe and avoid tracking overhead). Case B ensures that writes done by the accelerator will be visible to the processor. That is, it removes stale data from the processor so the processor will get the latest copy. Some processor implementations may be able to snoop on external writes, but this assumes the accelerator connection is located nearby (tightly coupled?) and observable by the processor (not true in most NoC). In Case C, a programmer may sometimes wish to combine this writeback with an evict of clean data; this can save them a step from also doing Case B later when those address ranges overlap. Workarounds involve changing caching policies. I see two cases here:
Case 1 eliminates the need to perform Case A, but retains the need to perform Case B. Case 2 eliminates the need to do all Cases (A, B, C and 1), but can be a lower-performance option. Note that Case A, B and C are historic (affecting data already in cache), whereas Case 1 and 2 are forward-looking (affecting what will be placed in the cache in the future). As a result, the CMO TG may choose to only tackle Cases A, B and C, but pass along Cases 1 and 2 to a TG charged with looking after VM and PMA. Very likely, the CMO-TG already has to tackle A, B and C no matter what (to fix up what has been placed in the cache already). MIPS, for example, has the ability to cover Case 1 and 2 by updating caching policies for each page in its page tables. Aside: Case 1 and 2 do operate similarly to range-based CMOs. However, they will have to iterate over page tables rather than cache lines. Is there anything we can learn about range-based CMOs when we think of these cases? (eg, if we want similar interfaces to both features, it is better to plan ahead) Using a trap to perform any of these Cases (A, B, C, 1 or 2) will impact the effectiveness of the accelerator. For Case 1 and 2, the operations can be done ahead of time (it implies that programmers should not change such properties with fine granularity). In contrast, Cases A, B and C will always be placed inline with performance-oriented code. |
Another use case: changing the cacheability/memory type, e.g. via the PTEs or PMAs. It looks pretty likely that the virtual memory group is one or a few going to provide a few bits to specify memory type in the page table entries. also, in some systems the PMAs ( physical memory attributes) may change dynamically. (The presence of the PMAs is assumed by RISC-V, but as far as I know there is no ice the definition of PMA registers or formats, although some aspects may be associated with PMPs, pending on how issues related to speculation when paging is disabled are resolved.) obviously, if a page that was cacheable, WB/WT, is changed to uncacheable UC, there may need to be cache flushes. BTW, This is a good example of a cross tech group issue that should probably be put in JIRA. I will finish writing it here and then consider moving it if I can figure out where to move it to. BTW, this use case immediately raise a question, that should probably move to a new issue. TBD: complete this issue... my PC needs to reboot :-( Basically, if we transition WB --> UC directly must be ready for possibility of UC accesses to memory that is in cache. Many systems dislike. (Mostly if a UC speculatable type; perhaps not if non-speculatable type or mode.) If we transition indirectly, break before make WB-->invalid-->UC, then --- our current POR is for CMOs to use virtual addresses. But if invalid... need physical? Maybe WB-->UC-non-spec --> UC-spec? But then Virt Mem TG needs to have UCS and UCNS mem types. === Fewer issues with UC-->WB/WT, but there will be some related to memory ordering and transactions in flight. |
Note that ARM pretty much solves this issue and provides recipes for both the TLB and cache management. The key is that CMOs ignore the cacheability attributes and always check caches. Another issue you bring up is "our current POR is for CMOs to use virtual addresses" I think this should be stated that CMOs use effective addresses, VM is orthogonal in my mind. Defining CMOs to use effective addresses side-steps all the mode stuff, including VM, virtualization, etc. Edit here: Additionally, we need to define these attribute transitions in the context of the PMAs. There are no defined "memory types" like other architectures.... :( |
Agreed: CMOS must ignore the cacheability attributes. That;'s not what I was talking about/ PTE.WB --> PTE.UC Break before make: PTE.WB --> PTE.INV --> PTE.UC when do we flush?
Could be the same as a DMA I/O
|
We need to collect use cases for CMOs. Want to document.
This issue to collect - brief summary up top. Add more cases in comments. Link o - wiki, email, wherever - more deail
---- ZBB? --------
The text was updated successfully, but these errors were encountered: