-
Notifications
You must be signed in to change notification settings - Fork 20
Fragment Shader ISA
The Tegra fragment shader ISA is not as straight forward as Vertex Shader ISA.
The fragment shader is separated into five different instruction streams:
- PSEQ - fetching data from the memory and feeding it to the pipeline as registers data or instructions
- MFU - multi-function unit, varying interpolate and special functions
- TEX - texture lookups
- ALU - arithmetic logic unit
- DW - writing to the output surface / buffers
The different units seems to be synchronized by separate timing streams.
Sequence: | 1 | 2 | 2 | 2 | 3 | 4 | 4 | 4 | 5 |
---|---|---|---|---|---|---|---|---|---|
Stage 1: | PSEQ | ||||||||
Stage 2: | MFU | MFU | MFU | ||||||
Stage 3: | TEX | ||||||||
Stage 4: | ALU | ALU | ALU | ||||||
Stage 5: | DW |
Allows to execute large shaders in multiple passes. Not used by in practice, BLOB driver gives up on large shaders.
Patent: https://www.google.com/patents/US8411096
Patent: https://www.google.com/patents/US8659601
Patent: https://www.google.com/patents/US8856499
Instructions schedule specifies the number of MFU and ALU instructions executed by the respective stage, each from 1 to 3 per fragment pipeline instructions batch.
Bits | Meaning |
---|---|
2..7 | Address |
0..1 | Number of instructions to execute |
If the "number of instructions to execute" is 0, then the pipeline stage is NOP, however still takes 1 clock cycle. The "address" is the number of pushed instructions before the instruction to execute, so unit[Address] ... unit[Address + Number of instructions to execute] instructions will be executed, where unit stands for MFU or ALU.
Registers / embedded constants can either be treated as one FP20 register, or two FX10 values. Their encoding is like so:
The FP20 format is similar to IEEE 754 FP32 and IEEE 754 FP16, but with both range and precision somewhere in the middle of the two.
Bits | Meaning |
---|---|
19 | Sign |
13..18 | Exponent |
0..12 | Significand |
This means that there's a 14 bit significand, with 13 bits explicitly stored. A 6 bit exponent gives an exponent bias of 31, with a minimum exponent value of -31, and a maximum exponent value of 32.
The FX10 format is similar to most signed fixed-point formats, using 10 bits of storage, and a scaling-factor of 1/256.
This means we have a minumum value of -4.0, and a maximum value of ~3.996.
The ALU instructions comes in packets of 3 or 4 scalar instructions (the fourth instruction can be traded for embedded constants). Each ALU instruction package seems to run pipelined, and each instruction in a package can use partial results from the previous instruction.
Bits | Meaning |
---|---|
62..63 | opcode |
61 | accumulate result: other |
60 | accumulate result: this |
59 | addition disable |
57..58 | scale result |
56 | saturate result |
54..55 | condition code |
47..53 | destination register |
46 | write high subregister |
45 | write low subregister |
32..44 | operand rA |
19..31 | operand rB |
6..18 | operand rC |
0..5 | operand rD |
Patent: https://www.google.com/patents/US7710427
Patent: https://www.google.com/patents/US8775777
Unlike a regular ALU instruction, the ALU3 instruction words, constituting immediate constants, shouldn't be swapped. Constant is either one fp20 or two fx10.
Bits | Meaning |
---|---|
44..63 | Immediate constant 2 |
24..43 | Immediate constant 1 |
4..23 | Immediate constant 0 |
opcode | Mnemonic | Meaning | pseudo-code |
---|---|---|---|
0 | MAD | Multiply-Add | rA * rB + rC * rD |
1 | MIN | Minimum | min(rA * rB, rC * rD) |
2 | MAX | Maximum | max(rA * rB, rC * rD) |
3 | CSEL | Conditional select | (rA < 0) ? rB : (rC * rD) |
Value | Meaning |
---|---|
0 | No scale |
1 | x2 |
2 | x4 |
3 | / 2 |
The conditional operation is applied to the ALU's result, so the final result is either 0.0 (false) or 1.0 (true).
Value | Meaning |
---|---|
0 | no comparison |
1 | zero |
2 | greater than zero |
3 | greater or equal to zero |
The condition register comprises two fixed10 values, 0.0 or 1.0. In order to write a value to the condition register, the destination register should be selected to the condition register 64..71. To write to the higher part of the condition register, "write high subregister" bit needs to be set; otherwise lower part will be written regardless of the "write low subregister" bit state.
Condition register stored value = !!(ALU result)
Looks like ALU can address up to 2 source CR's in one ALU[0-3] sub-instruction.
Non-zero value written to the KILL register discards the fragment, zero keeps it alive and doesn't resurrect the killed fragment. Destination register write mask (low/high halves) is ignored.
The returned fragment position X register value is accumulated by 8192.0, fragment position Y value is negative and accumulated by 8192.0 plus the render target height - 1.
pos.x = 8192 + frag.x
pos.y = 8192 + (target_height - 1) - frag.y
Bits | Meaning |
---|---|
12..6 | register selector |
5 | sub-register selector |
4 | fixed10 minus one |
3 | fixed10 |
2 | absolute value |
1 | negate |
0 | scale by two |
Bits | Meaning |
---|---|
5 | rD selector (0 = rB, 1 = rC) |
4 | sub-register selector |
3 | fixed10 minus one |
2 | enable rD (scale rC by rD) |
1 | absolute value |
0 | fixed10 |
Patents:
Value | Meaning |
---|---|
0..15 | row registers |
16..23 | global registers |
24..27 | ALU result registers |
28..30 | embedded constants |
31 | lowp vec2(0, 1) |
32..63 | uniform registers |
64..71 | condition registers |
72 | fragment x-position |
73 | fragment y-position |
75 | polygon face |
76 | kill (discard) fragment |
Patent: https://www.google.com/patents/US8521800
When "accumulate this" bit is set, operand rC is overridden with the accumulation value.
Accumulate this | Accumulate other | |
---|---|---|
ALU0: | ALU0 += ALU3 | No action |
ALU1: | ALU1 += ALU0 | ALU0 += ALU1 |
ALU2: | ALU2 += ALU1 | ALU0 += ALU2 |
ALU3: | ALU3 += ALU2 | ALU2 += ALU3 |
When bit "addition disable" is set, the Multiply-Add operation turns into two multiplies. The fx10 result of each multiply goes to the low/high subregisters of the destination register.
write high subregister | write low subregister | destination low | destination high |
---|---|---|---|
1 | 0 | rC*rD | rA*rB |
0 | 1 | rA*rB | rC*rD |
The MFU unit can fetch and interpolate 4 component vectors per instruction and/or evaluate scalar special functions. Based on this design.
Bits | Meaning |
---|---|
58..63 | register |
54..57 | opcode |
43..53 | mul1 |
32..42 | mul0 |
28..31 | ??? |
21..27 | var3 |
14..20 | var2 |
7..13 | var1 |
0..6 | var0 |
opcode | Mnemonic | Meaning | pseudo-code |
---|---|---|---|
0 | NOP | No operation | |
1 | RCP | Reciprocal | rD = 1.0 / rA |
2 | RSQ | Reciprocal square root | rD = 1.0 / sqrt(rA) |
3 | LG2 | Logarithm base 2 | rD = log2(rA) |
4 | EX2 | Exponent base 2, second step | rD = pow(2.0, rA) |
5 | SQRT | Square root | rD = sqrt(rA) |
6 | SIN | Sine, second step | rD = sin(rA) |
7 | COS | Cosine, second step | rD = cos(rA) |
8 | FRC | Fractional value | rD = rA - floor(rA) |
9 | PREEX2 | Exponent base 2, first step | rD = pow(2.0, rA) |
10 | PRESIN | Sine, first step | rD = sin(rA) |
11 | PRECOS | Cosine, first step | rD = cos(rA) |
The varying id corresponds to the TRAM component: var0 is TRAM.x, var1 is TRAM.y, var2 is TRAM.z, var3 is TRAM.w.
Bits | Meaning |
---|---|
3..6 | TRAM index to fetch from |
1..2 | opcode |
0 | saturate |
opcode | Mnemonic | Meaning |
---|---|---|
0 | NOP | No operation |
1 | VAR1 | Interpolate one float20 value |
2 | VAR2 | Interpolate two fixed10 values |
Bits | Meaning |
---|---|
8..10 | destination register |
4..7 | source register 1 |
0..3 | source register 0 |
Value | Meaning |
---|---|
0 | ??? |
1 | barycentric weight |
2..3 | ??? |
4 | row register 0 |
5 | row register 1 |
6 | row register 2 |
7 | row register 3 |
The SFU result is evaluated first and is available to use by MUL's via "SFU result" register.
Value | Meaning |
---|---|
0..3 | row register 0..3 |
4..7 | ??? (global registers?) |
8..9 | ??? |
10 | SFU result |
11 | barycentric coef 0 |
12 | barycentric coef 1 |
13 | 1.0 |
14..15 | ??? |
The barycentric interpolation weights are the MUL's results written to the "barycentric weight" destination register. The weight w0 (related to the first triangle vertex - the "barycentric coef 0" source register) is hardwired to the result of the mul0, the second vertex w1 ("barycentric coef 1" source register) to the mul1. Again, only destination "barycentric weight" registers are hardwired, "barycentric coef" sources are not. The third vertex weight is derived from the w0 and w1 as "1.0 - w0 - w1". The SFU operation should be set to "rcp r4".
barycentric weight = barycentric coef * 1.0 / w
The w component is stored in the r4 and available to the first instruction of the scheduled MFU instructions sequence.
The z component comes in some form via r3. The fetch operation of the r3 should be set to NOP with "saturation" being enabled for r3. After the r3 has been populated, the following expansion should be performed:
gl_FragCoord.z = 1/1000 + max(0.0, r3.low) * 1/4000 + max(0.0, r3.high) * 1/4
There is also dependency on the linker: it should perform the "magic" write to the TRAM0.w with "VEC4 select = VEC4.z".
The TEX instruction takes the texture coordinates and LOD bias from the first row of the pixel packet (i.e. R0-R3 registers) and writes the sampled data to that first row as well.
Bits | Meaning |
---|---|
12..31 | ??? |
12 | enable bias |
11 | ??? |
10 | enable |
6..9 | ??? |
5 | sampled data destination regs select |
4 | texcoords / lod regs select |
0..3 | sampler index |
The texture coordinate components (S, T, R) and level-of-detail bias are loaded from the row registers as fp20's.
Value | Meaning (S, T, R, LOD order) |
---|---|
1 | R2, R3, R0, R1 |
0 | R0, R1, R2, R3 |
The sampled RGBA data is stored in the two registers as four fx10's.
Value | Meaning |
---|---|
1 | R2-R3 |
0 | R0-R1 |
Data write instruction controls write of values contained in the row registers R0-R1 / R2-R3 to the destination render target.
Bits | Meaning |
---|---|
16..31 | ??? |
15 | source registers select |
11..14 | ??? |
10 | stencil write |
6..9 | ??? |
2..5 | render target index |
1 | ??? |
0 | enable |
Seems have no effect when the depth or stencil write enabled.
Value | Meaning |
---|---|
1 | R2-R3 |
0 | R0-R1 |
Some of the render targets have an additional special purpose, like depth/stencil store. They are hardwired and their special purpose is active under certain conditions, like when depth/stencil test is enabled.
Render target | Usage |
---|---|
0 | Depth buffer |
2 | Stencil buffer |
PSEQ stands for Program Sequencer. It fetches raw data from a selected render target, converts that data into FX10 [FP20(?)] format and loads it into registers.
Bits | Meaning |
---|---|
24..31 | ??? |
23 | enable (?) |
20..22 | ??? |
16..19 | render target select |
4..15 | ??? |
3 | enable something (?) |
2 | ??? |
1 | dest registers select |
0 | ??? |
Value | Meaning |
---|---|
1 | R2 - R3 |
0 | R0 - R1 |
XXX: the above is valid for fetching 32bit RGBA8888 into FX10 destination registers.
These seems to be for Tegra:
- US8314803: "Buffering deserialized pixel data in a graphics processor unit pipeline"
- US9183607: "Scoreboard cache coherence in a graphics pipeline"
- US7808512: "Bounding region accumulation for graphics rendering"
- US8441497: "Interpolation of vertex attributes in a graphics processor"
These seems to be for GoForce (based on filing dates):
- US7969446: "Method for operating low power programmable processor"
- US8749576: "Method and system for implementing multiple high precision and low precision interpolators for a graphics pipeline"
- EP1759380: "Low power programmable processor"
- US7298375: "Arithmetic logic units in series in a graphics pipeline"
- US7724263: "System and method for a universal data write unit in a 3-D graphics pipeline including generic cache memories"
- EP1665165: "Pixel processing system and method"
- US8711155: "Early kill removal graphics processing system and method"
- US7199799: "Interleaving of pixels for low power programmable processor"