Skip to content

draft Fixed Block Size Prefetches and CMOs

Andy Glew (Dell ) edited this page Jun 18, 2020 · 10 revisions

Proposed name: PREFETCH.64B.R

  • encoding: ORI with RD=R0, i.e. M[rs1+offset12]

    • imm12.rs1:5.110.rd=00000.0010011

  • affects cache line containing virtual address M[rs1+offset12]

  • see Mnemonics and Names for a discussion of proposed mnemonics and names

Proposed name: PREFETCH.64B.W [^mnemonics]

  • encoding: ANDI with RD=R0, i.e. M[rs1+offset12]

    • imm12.rs1:5.110.rd=00000.0110011

  • affects cache line containing virtual address M[rs1+offset12]

  • see Mnemonics and Names for a discussion of proposed mnemonics and names

OBSOLETE: Fixed Block Size Clean and Flush CMOs

Note
Obsolete

Earlier drafts of this proposal contained fixed block size CMOs, e.g. cache flushes. Like the PREFETCHes, but without the full addressing mode to save instruction encoding space. These have been removed from the proposal, subsumed by the prefetch flavors of the variable address range CMO.VAR instructions.

DETAILS

  • Page Fault: NOT taken for PREFETCH

    • The intent is that loops may access data right up to a page boundary beyond which they are not allowed, and may contain prefetches that are an arbitrary stride past the current ordinary memory access. Therefore, such address range prefetches should be ignored.

      • ⇒ Not useful for initiating virtual memory swaps from disk, copy-on-write, and prefetches in some "Two Level Memory" systems, e.g. with NVRAM, etc., which may involve OS page table management in a deferred manner. (TBD: link to paper (CW))

  • Debug exceptions, e.g. data address breakpoints: YES taken.

Note that page table protections are sometimes used as part of a debugging strategy. Therefore, ignoring page table faults is inconsistent with permitting debug exceptions

  • ECC and other machine check exceptions: taken?

    • In the interest of finding bugs earlier.

    • Although this is somewhat incompatible with allowing these prefetches to become NOPs

NOTE:

Note
Rationale: Addressing Modes

Want full addressing mode for fixed block size prefetches, Reg+Offset, so that compiler can just add a prefetch stride to the offset, does not need to allocate extra registers for the prefetch address

Note
Rationale:Fixed minimum block size - NOT cache line size

These instructions are associated with a fixed block size - actually a minimum fixed block size. NOT necessarily the microarchitecture specific cache line size.

Currently the fixed block size is only defined to be 64 bytes. Instruction encodings are reserves for other block sizes, e.g. 256 bytes. However, there is unlikely to be room to support all possible cache line sizes in these instructions.

The fixed block size of these instructions is NOT necessarily a cache line size. The intention is to hide the microarchitecture cache line size, which may even be different on different cache levels in the same machine, while allowing reasonably good performance across machines with different cache line sizes.

The fixed minimum block size (FSZ) is essentially a contract that tells software that it does not need to prefetch more often than that size. Implementations are permitted to "round up" FSZ: e.g. on a machine with 256 byte cache lines, each PREFETCH.64B.[RW] Conversely, on a machine with 32 byte cache lines, it is recommended that implementations of these instructions to address A apply similar operations to cache lines containing address A and A+32. "It is recommended" because it is permissible for all of these operations defined on this page to be ignored, treated as NOPs or hints.

The intent of the fixed minimum block size is to set an upper bound on prefetch instruction overhead. E.g. if standing an array of 32 byte items LOOP A[i] ENDLOOP, one might prefetch at every iteration of the loop LOOP A[i]; prefetch A[i+delta] ENDLOOP. However, prefetch instruction overhead often outweighs the memory latency benefit of prefetch instructions. If one knows that the cache line size is 256 bytes, i.e. once every 256/4=64 iterations of the loop, one might unroll the loop 64 times LOOP A[i+0]; …​ A[i+63]; prefetch A[i+63+delta] ENDLOOP, thereby reducing the prefetch instruction overhead to 1/64. But if the cache line size is 64 bytes you only need to enroll 64/4=16 times: LOOP A[i+0]; …​ A[i+15]; prefetch A[i+15+delta] ENDLOOP. The prefetches are relatively more important, but the overhead of unrolling code to exactly match the line size is greatly reduced.

The fixed minimum block size is an indication that the user does not need to place prefetches any closer together to get the benefit of prefetching all of a contiguous memory region.

Clone this wiki locally