Skip to content

Fragment Shader ISA

Dmitry Osipenko edited this page Mar 25, 2018 · 118 revisions

Overview

The Tegra fragment shader ISA is not as straight forward as Vertex Shader ISA.

The fragment shader is separated into five different instruction streams:

  • PSEQ - fetching data from the memory and feeding it to the pipeline as registers data or instructions
  • MFU - multi-function unit, varying interpolate and special functions
  • TEX - texture lookups
  • ALU - arithmetic logic unit
  • DW - writing to the output surface / buffers

The different units seems to be synchronized by separate timing streams.

Instructions flow

Sequence: 1 2 2 2 3 4 4 4 5
Stage 1: PSEQ
Stage 2: MFU MFU MFU
Stage 3: TEX
Stage 4: ALU ALU ALU
Stage 5: DW

Program sequencer

Patent: https://www.google.com/patents/US8411096

Instructions scheduling

Patent: https://www.google.com/patents/US8856499

Instructions schedule specifies the number of MFU and ALU instructions executed by the respective stage, each from 1 to 3 per fragment pipeline instructions batch.

Bits Meaning
2..7 Address
0..1 Number of instructions to execute

If the "number of instructions to execute" is 0, then the pipeline stage is NOP, however still takes 1 clock cycle. The "address" is the number of pushed instructions before the instruction to execute, so unit[Address] ... unit[Address + Number of instructions to execute] instructions will be executed, where unit stands for MFU or ALU.

Registers

Registers / embedded constants can either be treated as one FP20 register, or two FX10 values. Their encoding is like so:

FP20

The FP20 format is similar to IEEE 754 FP32 and IEEE 754 FP16, but with both range and precision somewhere in the middle of the two.

Bits Meaning
19 Sign
13..18 Exponent
0..12 Significand

This means that there's a 14 bit significand, with 13 bits explicitly stored. A 6 bit exponent gives an exponent bias of 31, with a minimum exponent value of -31, and a maximum exponent value of 32.

FX10

The FX10 format is similar to most signed fixed-point formats, using 10 bits of storage, and a scaling-factor of 1/256.

This means we have a minumum value of -4.0, and a maximum value of ~3.996.

ALU instruction word encoding

The ALU instructions comes in packets of 3 or 4 scalar instructions (the fourth instruction can be traded for embedded constants). Each ALU instruction package seems to run pipelined, and each instruction in a package can use partial results from the previous instruction.

Bits Meaning
62..63 opcode
61 accumulate result: other
60 accumulate result: this
59 addition disable
57..58 scale result
56 saturate result
54..55 condition code
47..53 destination register
46 write high subregister
45 write low subregister
32..44 operand rA
19..31 operand rB
6..18 operand rC
0..5 operand rD

ALU buffer and pixel packet:

Patent: https://www.google.com/patents/US7710427

ALU embedded constants:

Patent: https://www.google.com/patents/US8775777

Unlike a regular ALU instruction, the ALU3 instruction words, constituting immediate constants, shouldn't be swapped. Constant is either one fp20 or two fx10.

Bits Meaning
44..63 Immediate constant 2
24..43 Immediate constant 1
4..23 Immediate constant 0

Opcodes:

opcode Mnemonic Meaning pseudo-code
0 MAD Multiply-Add rA * rB + rC * rD
1 MIN Minimum min(rA * rB, rC * rD)
2 MAX Maximum max(rA * rB, rC * rD)
3 CSEL Conditional select (rA < 0) ? rB : (rC * rD)

Scale result:

Value Meaning
0 No scale
1 x2
2 x4
3 / 2

Condition code:

The conditional operation is applied to the ALU's result, so the final result is either 0.0 (false) or 1.0 (true).

Value Meaning
0 no comparison
1 zero
2 greater than zero
3 greater or equal to zero

Condition registers:

The condition register comprises two fixed10 values, 0.0 or 1.0. In order to write a value to the condition register, the destination register should be selected to the condition register 64..71. To write to the higher part of the condition register, "write high subregister" bit needs to be set; otherwise lower part will be written regardless of the "write low subregister" bit state.

Condition register stored value = !!(ALU result)

Looks like ALU can address up to 2 source CR's in one ALU[0-3] sub-instruction.

Kill register:

Non-zero value written to the KILL register discards the fragment, zero keeps it alive and doesn't resurrect the killed fragment. Destination register write mask (low/high halves) is ignored.

Position registers:

The returned fragment position X register value is accumulated by 8192.0, fragment position Y value is negative and accumulated by 8192.0 plus the render target height - 1.

pos.x = 8192 + frag.x
pos.y = 8192 + (target_height - 1) - frag.y

Operands (rA, rB, rC):

Bits Meaning
12..6 register selector
5 sub-register selector
4 fixed10 minus one
3 fixed10
2 absolute value
1 negate
0 scale by two

Operand rD:

Bits Meaning
5 rD selector (0 = rB, 1 = rC)
4 sub-register selector
3 fixed10 minus one
2 enable rD (scale rC by rD)
1 absolute value
0 fixed10

Registers:

Patents:

Value Meaning
0..15 row registers
16..23 global registers
24..27 ALU result registers
28..30 embedded constants
31 lowp vec2(0, 1)
32..63 uniform registers
64..71 condition registers
72 fragment x-position
73 fragment y-position
75 polygon face
76 kill (discard) fragment

Result accumulation:

Patent: https://www.google.com/patents/US8521800

When "accumulate this" bit is set, operand rC is overridden with the accumulation value.

Accumulate this Accumulate other
ALU0: ALU0 += ALU3 No action
ALU1: ALU1 += ALU0 ALU0 += ALU1
ALU2: ALU2 += ALU1 ALU0 += ALU2
ALU3: ALU3 += ALU2 ALU2 += ALU3

Addition disable

When bit "addition disable" is set, the Multiply-Add operation turns into two multiplies. The fx10 result of each multiply goes to the low/high subregisters of the destination register.

write high subregister write low subregister destination low destination high
1 0 rC*rD rA*rB
0 1 rA*rB rC*rD

MFU instruction word encoding

The MFU unit can fetch and interpolate 4 component vectors per instruction and/or evaluate scalar special functions. Based on this design.

Bits Meaning
58..63 register
54..57 opcode
43..53 mul1
32..42 mul0
28..31 ???
21..27 var3
14..20 var2
7..13 var1
0..6 var0
opcode Mnemonic Meaning pseudo-code
0 NOP No operation
1 RCP Reciprocal rD = 1.0 / rA
2 RSQ Reciprocal square root rD = 1.0 / sqrt(rA)
3 LG2 Logarithm base 2 rD = log2(rA)
4 EX2 Exponent base 2, second step rD = pow(2.0, rA)
5 SQRT Square root rD = sqrt(rA)
6 SIN Sine, second step rD = sin(rA)
7 COS Cosine, second step rD = cos(rA)
8 FRC Fractional value rD = rA - floor(rA)
9 PREEX2 Exponent base 2, first step rD = pow(2.0, rA)
10 PRESIN Sine, first step rD = sin(rA)
11 PRECOS Cosine, first step rD = cos(rA)

var0..3:

The varying id corresponds to the TRAM component: var0 is TRAM.x, var1 is TRAM.y, var2 is TRAM.z, var3 is TRAM.w.

Bits Meaning
3..6 TRAM index to fetch from
1..2 opcode
0 saturate
opcode Mnemonic Meaning
0 NOP No operation
1 VAR1 Interpolate one float20 value
2 VAR2 Interpolate two fixed10 values

mul0..1

Bits Meaning
8..10 destination register
4..7 source register 1
0..3 source register 0
mul0..1 destination registers:
Value Meaning
0 ???
1 barycentric weight
2..3 ???
4 row register 0
5 row register 1
6 row register 2
7 row register 3
mul0..1 source registers:

The SFU result is evaluated first and is available to use by MUL's via "SFU result" register.

Value Meaning
0..3 row register 0..3
4..7 ??? (global registers?)
8..9 ???
10 SFU result
11 barycentric coef 0
12 barycentric coef 1
13 1.0
14..15 ???

Interpolation

The barycentric interpolation weights are the MUL's results written to the "barycentric weight" destination register. The weight w0 (related to the first triangle vertex - the "barycentric coef 0" source register) is hardwired to the result of the mul0, the second vertex w1 ("barycentric coef 1" source register) to the mul1. Again, only destination "barycentric weight" registers are hardwired, "barycentric coef" sources are not. The third vertex weight is derived from the w0 and w1 as "1.0 - w0 - w1". The SFU operation should be set to "rcp r4".

barycentric weight = barycentric coef * 1.0 / w

Fragment w component

The w component is stored in the r4 and available to the first instruction of the scheduled MFU instructions sequence.

Fragment z component

The z component comes in some form via r3. The fetch operation of the r3 should be set to NOP with "saturation" being enabled for r3. After the r3 has been populated, the following expansion should be performed:

gl_FragCoord.z = 1/1000 + max(0.0, r3.low) * 1/4000 + max(0.0, r3.high) * 1/4

There is also dependency on the linker: it should perform the "magic" write to the TRAM0.w with "VEC4 select = VEC4.z".

TEX instruction word encoding

The TEX instruction takes the texture coordinates and LOD bias from the first row of the pixel packet (i.e. R0-R3 registers) and writes the sampled data to that first row as well.

Bits Meaning
12..31 ???
12 enable bias
11 ???
10 enable
6..9 ???
5 sampled data destination regs select
4 texcoords / lod regs select
0..3 sampler index

Texture coordinates and LOD registers select:

The texture coordinate components (S, T, R) and level-of-detail bias are loaded from the row registers as fp20's.

Value Meaning (S, T, R, LOD order)
1 R2, R3, R0, R1
0 R0, R1, R2, R3

Sampled data destination registers select:

The sampled RGBA data is stored in the two registers as four fx10's.

Value Meaning
1 R2-R3
0 R0-R1

DW instruction word encoding

Data write instruction controls write of values contained in the row registers R0-R1 / R2-R3 to the destination render target.

Bits Meaning
16..31 ???
15 source registers select
11..14 ???
10 stencil write
6..9 ???
2..5 render target index
1 ???
0 enable

Source registers select:

Seems have no effect when the depth or stencil write enabled.

Value Meaning
1 R2-R3
0 R0-R1

Render targets usage

Some of the render targets have an additional special purpose, like depth/stencil store. They are hardwired and their special purpose is active under certain conditions, like when depth/stencil test is enabled.

Render target Usage
0 Depth buffer
2 Stencil buffer

PSEQ instruction word encoding

PSEQ stands for Program Sequencer. It fetches raw data from a selected render target, converts that data into FX10 [FP20(?)] format and loads it into registers.

Bits Meaning
24..31 ???
23 enable (?)
20..22 ???
16..19 render target select
4..15 ???
3 enable something (?)
2 ???
1 dest registers select
0 ???
Destination registers select Meaning
0 R0 - R1
1 R2 - R3

XXX: the above is valid for fetching 32bit RGBA8888 into FX10 destination registers.

Clone this wiki locally