Skip to content

Latest commit

 

History

History
653 lines (593 loc) · 31.6 KB

geometrytransformationenginegte.md

File metadata and controls

653 lines (593 loc) · 31.6 KB

Geometry Transformation Engine (GTE)

GTE Overview
GTE Registers
GTE Saturation
GTE Opcode Summary
GTE Coordinate Calculation Commands
GTE General Purpose Calculation Commands
GTE Color Calculation Commands
GTE Division Inaccuracy

GTE Overview

GTE Operation

The GTE doesn't have any memory or I/O ports mapped to the CPU memory bus, instead, it's solely accessed via coprocessor opcodes:

  mov  cop0r12,rt          ;-enable/disable COP2 (GTE) via COP0 status register
  mov  cop2r0-63,rt        ;\write parameters to GTE registers
  mov  cop2r0-31,[rs+imm]  ;/
  mov  cop2cmd,imm25       ;-issue GTE command
  mov  rt,cop2r0-63        ;\read results from GTE registers
  mov  [rs+imm],cop2r0-31  ;/
  jt   cop2flg,dest        ;-jump never  ;\implemented (no exception), but,
  jf   cop2flg,dest        ;-jump always ;/flag seems to be always "false"

GTE (memory-?) load and store instructions have a delay of 2 instructions, for any GTE commands or operations accessing that register. Any? That's wrong!
GTE instructions and functions should not be used in

  - Delay slots of jumps and branches
  - Event handlers or interrupts (sounds like nonsense?) (need push/pop though)

If an instruction that reads a GTE register or a GTE command is executed before the current GTE command is finished, the CPU will hold until the instruction has finished. The number of cycles each GTE instruction takes is shown in the command list.

GTE Command Encoding (COP2 imm25 opcodes)

  31-25  Must be 0100101b for "COP2 imm25" instructions
  20-24  Fake GTE Command Number (00h..1Fh) (ignored by hardware)
  19     sf - Shift Fraction in IR registers (0=No fraction, 1=12bit fraction)
  17-18  MVMVA Multiply Matrix    (0=Rotation. 1=Light, 2=Color, 3=Reserved)
  15-16  MVMVA Multiply Vector    (0=V0, 1=V1, 2=V2, 3=IR/long)
  13-14  MVMVA Translation Vector (0=TR, 1=BK, 2=FC/Bugged, 3=None)
  11-12  Always zero                        (ignored by hardware)
  10     lm - Saturate IR1,IR2,IR3 result (0=To -8000h..+7FFFh, 1=To 0..+7FFFh)
  6-9    Always zero                        (ignored by hardware)
  0-5    Real GTE Command Number (00h..3Fh) (used by hardware)

The MVMVA bits are used only by the MVMVA opcode (the bits are zero for all other opcodes).
The "sf" and "lm" bits are usually fixed (either set, or cleared, depending on the command) (for MVMVA, the bits are variable) (also, "sf" can be changed for some commands like SQR) (although they are usually fixed for most other opcodes, changing them might have some effect on some/all opcodes)?

GTE Data Register Summary (cop2r0-31)

  cop2r0-1   3xS16 VXY0,VZ0              Vector 0 (X,Y,Z)
  cop2r2-3   3xS16 VXY1,VZ1              Vector 1 (X,Y,Z)
  cop2r4-5   3xS16 VXY2,VZ2              Vector 2 (X,Y,Z)
  cop2r6     4xU8  RGBC                  Color/code value
  cop2r7     1xU16 OTZ                   Average Z value (for Ordering Table)
  cop2r8     1xS16 IR0                   16bit Accumulator (Interpolate)
  cop2r9-11  3xS16 IR1,IR2,IR3           16bit Accumulator (Vector)
  cop2r12-15 6xS16 SXY0,SXY1,SXY2,SXYP   Screen XY-coordinate FIFO  (3 stages)
  cop2r16-19 4xU16 SZ0,SZ1,SZ2,SZ3       Screen Z-coordinate FIFO   (4 stages)
  cop2r20-22 12xU8 RGB0,RGB1,RGB2        Color CRGB-code/color FIFO (3 stages)
  cop2r23    4xU8  (RES1)                Prohibited
  cop2r24    1xS32 MAC0                  32bit Maths Accumulators (Value)
  cop2r25-27 3xS32 MAC1,MAC2,MAC3        32bit Maths Accumulators (Vector)
  cop2r28-29 1xU15 IRGB,ORGB             Convert RGB Color (48bit vs 15bit)
  cop2r30-31 2xS32 LZCS,LZCR             Count Leading-Zeroes/Ones (sign bits)

GTE Control Register Summary (cop2r32-63)

  cop2r32-36 9xS16 RT11RT12,..,RT33 Rotation matrix     (3x3)        ;cnt0-4
  cop2r37-39 3x 32 TRX,TRY,TRZ      Translation vector  (X,Y,Z)      ;cnt5-7
  cop2r40-44 9xS16 L11L12,..,L33    Light source matrix (3x3)        ;cnt8-12
  cop2r45-47 3x 32 RBK,GBK,BBK      Background color    (R,G,B)      ;cnt13-15
  cop2r48-52 9xS16 LR1LR2,..,LB3    Light color matrix source (3x3)  ;cnt16-20
  cop2r53-55 3x 32 RFC,GFC,BFC      Far color           (R,G,B)      ;cnt21-23
  cop2r56-57 2x 32 OFX,OFY          Screen offset       (X,Y)        ;cnt24-25
  cop2r58 BuggyU16 H                Projection plane distance.       ;cnt26
  cop2r59      S16 DQA              Depth queing parameter A (coeff) ;cnt27
  cop2r60       32 DQB              Depth queing parameter B (offset);cnt28
  cop2r61-62 2xS16 ZSF3,ZSF4        Average Z scale factors          ;cnt29-30
  cop2r63      U20 FLAG             Returns any calculation errors   ;cnt31

GTE Registers

Note in some functions format is different from the one that's given here.

Matrix Registers

  Rotation matrix (RT)   Light matrix (LLM)     Light Color matrix (LCM)
  cop2r32.lsbs=RT11      cop2r40.lsbs=L11       cop2r48.lsbs=LR1
  cop2r32.msbs=RT12      cop2r40.msbs=L12       cop2r48.msbs=LR2
  cop2r33.lsbs=RT13      cop2r41.lsbs=L13       cop2r49.lsbs=LR3
  cop2r33.msbs=RT21      cop2r41.msbs=L21       cop2r49.msbs=LG1
  cop2r34.lsbs=RT22      cop2r42.lsbs=L22       cop2r50.lsbs=LG2
  cop2r34.msbs=RT23      cop2r42.msbs=L23       cop2r50.msbs=LG3
  cop2r35.lsbs=RT31      cop2r43.lsbs=L31       cop2r51.lsbs=LB1
  cop2r35.msbs=RT32      cop2r43.msbs=L32       cop2r51.msbs=LB2
  cop2r36     =RT33      cop2r44     =L33       cop2r52     =LB3

Each element is 16bit (1bit sign, 3bit integer, 12bit fraction). Reading the last elements (RT33,L33,LB3) returns the 16bit value sign-expanded to 32bit.

Translation Vector (TR) (Input, R/W?)

  cop2r37 (cnt5) - TRX - Translation vector X (R/W?)
  cop2r38 (cnt6) - TRY - Translation vector Y (R/W?)
  cop2r39 (cnt7) - TRZ - Translation vector Z (R/W?)

Each element is 32bit (1bit sign, 31bit integer).
Used only for MVMVA, RTPS, RTPT commands.

Background Color (BK) (Input?, R/W?)

  cop2r45 (cnt13) - RBK - Background color red component
  cop2r46 (cnt14) - GBK - Background color green component
  cop2r47 (cnt15) - BBK - Background color blue component

Each element is 32bit (1bit sign, 19bit integer, 12bit fraction).

Far Color (FC) (Input?) (R/W?)

  cop2r53 (cnt21) - RFC - Far color red component
  cop2r54 (cnt22) - GFC - Far color green component
  cop2r55 (cnt23) - BFC - Far color blue component

Each element is 32bit (1bit sign, 27bit integer, 4bit fraction).

Screen Offset and Distance (Input, R/W?)

  cop2r56 (cnt24) - OFX - Screen offset X
  cop2r57 (cnt25) - OFY - Screen offset Y
  cop2r58 (cnt26) - H   - Projection plane distance
  cop2r59 (cnt27) - DQA - Depth queing parameter A.(coeff.)
  cop2r60 (cnt28) - DQB - Depth queing parameter B.(offset.)

The X and Y values are each 32bit (1bit sign, 15bit integer, 16bit fraction).
The H value is 16bit unsigned (0bit sign, 16bit integer, 0bit fraction). BUG: When reading the H register, the hardware does accidently <sign-expand> the <unsigned> 16bit value (ie. values +8000h..+FFFFh are returned as FFFF8000h..FFFFFFFFh) (this bug applies only to "mov rd,cop2r58" opcodes; the actual calculations via RTPS/RTPT opcodes are working okay).
The DQA value is only 16bit (1bit sign, 7bit integer, 8bit fraction).
The DQB value is 32bit (1bit sign, 7bit integer, 24bit? fraction).
Used only for RTPS/RTPT commands.

Average Z Registers (ZSF3/ZSF4=Input, R/W?) (OTZ=Result, R)

  cop2r61 (cnt29) ZSF3 |  0|ZSF3 1,3,12| Z3 average scale factor (normally 1/3)
  cop2r62 (cnt30) ZSF4 |  0|ZSF4 1,3,12| Z4 average scale factor (normally 1/4)
  cop2r7       OTZ (R) |   |OTZ 0,15, 0| Average Z value (for Ordering Table)

Used only for AVSZ3/AVSZ4 commands.

Screen XYZ Coordinate FIFOs

  cop2r12 - SXY0  rw|SY0 1,15, 0|SX0 1,15, 0| Screen XY fifo (older)
  cop2r13 - SXY1  rw|SY1 1,15, 0|SX1 1,15, 0| Screen XY fifo (old)
  cop2r14 - SXY2  rw|SY2 1,15, 0|SX2 1,15, 0| Screen XY fifo (new)
  cop2r15 - SXYP  rw|SYP 1,15, 0|SXP 1,15, 0| SXY2-mirror with move-on-write
  cop2r16 - SZ0   rw|          0|SZ0 0,16, 0| Screen Z fifo (oldest)
  cop2r17 - SZ1   rw|          0|SZ1 0,16, 0| Screen Z fifo (older)
  cop2r18 - SZ2   rw|          0|SZ2 0,16, 0| Screen Z fifo (old)
  cop2r19 - SZ3   rw|          0|SZ3 0,16, 0| Screen Z fifo (new)

SX,SY,SZ are used as Output for RTPS/RTPT. Additionally, SX,SY are used as Input for NCLIP, and SZ is used as Input for AVSZ3/AVSZ4.
The SZn Fifo has 4 stages (required for AVSZ4 command), the SXYn Fifo has only 3 stages, and a special mirrored register: SXYP is a mirror of SXY2, the difference is that writing to SXYP moves SXY2/SXY1 to SXY1/SXY0, whilst writing to SXY2 (or any other SXYn or SZn registers) changes only the written register, but doesn't move any other Fifo entries.

16bit Vectors (R/W)

  Vector 0 (V0)         Vector 1 (V1)       Vector 2 (V2)       Vector 3 (IR)
  cop2r0.lsbs - VX0     cop2r2.lsbs - VX1   cop2r4.lsbs - VX2   cop2r9  - IR1
  cop2r0.msbs - VY0     cop2r2.msbs - VY1   cop2r4.msbs - VY2   cop2r10 - IR2
  cop2r1      - VZ0     cop2r3      - VZ1   cop2r5      - VZ2   cop2r11 - IR3

All elements are signed 16bit. The IRn and VZn elements occupy a whole 32bit register, reading these registers returns the 16bit value sign-expanded to 32bit. Note: IRn can be also indirectly accessed via IRGB/ORGB registers.

Color Register and Color FIFO

  cop2r6  - RGBC  rw|CODE |B    |G    |R    | Color/code
  cop2r20 - RGB0  rw|CD0  |B0   |G0   |R0   | Characteristic color fifo.
  cop2r21 - RGB1  rw|CD1  |B1   |G1   |R1   |
  cop2r22 - RGB2  rw|CD2  |B2   |G2   |R2   |
  cop2r23 - (RES1)  |                       | Prohibited

RES1 seems to be unused... looks like an unused Fifo stage... RES1 is read/write-able... unlike SXYP (for SXYn Fifo) it does not mirror to RGB2, nor does it have a move-on-write function...

Interpolation Factor

  cop2r8   IR0   rw|Sign       |IR0 1, 3,12| Intermediate value 0.

Used as Output for RTPS/RTPT, and as Input for various commands.

XX...

  cop2r24  MAC0  rw|MAC0 1,31,0            | Sum of products value 0

XX...

  cop2r25  MAC1  rw|MAC1 1,31,0            | Sum of products value 1
  cop2r26  MAC2  rw|MAC2 1,31,0            | Sum of products value 2
  cop2r27  MAC3  rw|MAC3 1,31,0            | Sum of products value 3

cop2r28 - IRGB - Color conversion Input (R/W)

Expands 5:5:5 bit RGB (range 0..1Fh) to 16:16:16 bit RGB (range 0000h..0F80h).

  0-4    Red   (0..1Fh) (R/W)  ;multiplied by 80h, and written to IR1
  5-9    Green (0..1Fh) (R/W)  ;multiplied by 80h, and written to IR2
  10-14  Blue  (0..1Fh) (R/W)  ;multiplied by 80h, and written to IR3
  15-31  Not used (always zero) (Read only)

After writing to IRGB, the result can be read from IR3 after TWO nop's, and from IR1,IR2 after THREE nop's (for uncached code, ONE nop would work). When using IR1,IR2,IR3 as parameters for GTE commands, similar timing restrictions might apply... depending on when the specific commands use the parameters?

cop2r29 - ORGB - Color conversion Output (R)

Collapses 16:16:16 bit RGB (range 0000h..0F80h) to 5:5:5 bit RGB (range 0..1Fh). Negative values (8000h..FFFFh/80h) are saturated to 00h, large positive values (1000h..7FFFh/80h) are saturated to 1Fh, there are no overflow or saturation flags set in cop2r63 though.

  0-4    Red   (0..1Fh) (R)  ;IR1 divided by 80h, saturated to +00h..+1Fh
  5-9    Green (0..1Fh) (R)  ;IR2 divided by 80h, saturated to +00h..+1Fh
  10-14  Blue  (0..1Fh) (R)  ;IR3 divided by 80h, saturated to +00h..+1Fh
  15-31  Not used (always zero) (Read only)

Any changes to IR1,IR2,IR3 are reflected to this register (and, actually also to IRGB) (ie. ORGB is simply a read-only mirror of IRGB).

cop2r30 - LZCS - Count Leading Bits Source data (R/W)

cop2r31 - LZCR - Count Leading Bits Result (R)

Reading LZCR returns the leading 0 count of LZCS if LZCS is positive and the leading 1 count of LZCS if LZCS is negative. The results are in range 1..32.

cop2r63 (cnt31) - FLAG - Returns any calculation errors.

See GTE Saturation chapter.

GTE Saturation

Maths overflows are indicated in FLAG register. In most cases, the result is saturated to MIN/MAX values (except MAC0,MAC1,MAC2,MAC3 which aren't saturated). For IR1,IR2,IR3 many commands allow to select the MIN value via "lm" bit of the GTE opcode (though not all commands, RTPS/RTPT always act as if lm=0).

cop2r63 (cnt31) - FLAG - Returns any calculation errors.

  31   Error Flag (Bit30..23, and 18..13 ORed together) (Read only)
  30   MAC1 Result larger than 43 bits and positive
  29   MAC2 Result larger than 43 bits and positive
  28   MAC3 Result larger than 43 bits and positive
  27   MAC1 Result larger than 43 bits and negative
  26   MAC2 Result larger than 43 bits and negative
  25   MAC3 Result larger than 43 bits and negative
  24   IR1 saturated to +0000h..+7FFFh (lm=1) or to -8000h..+7FFFh (lm=0)
  23   IR2 saturated to +0000h..+7FFFh (lm=1) or to -8000h..+7FFFh (lm=0)
  22   IR3 saturated to +0000h..+7FFFh (lm=1) or to -8000h..+7FFFh (lm=0)
  21   Color-FIFO-R saturated to +00h..+FFh
  20   Color-FIFO-G saturated to +00h..+FFh
  19   Color-FIFO-B saturated to +00h..+FFh
  18   SZ3 or OTZ saturated to +0000h..+FFFFh
  17   Divide overflow. RTPS/RTPT division result saturated to max=1FFFFh
  16   MAC0 Result larger than 31 bits and positive
  15   MAC0 Result larger than 31 bits and negative
  14   SX2 saturated to -0400h..+03FFh
  13   SY2 saturated to -0400h..+03FFh
  12   IR0 saturated to +0000h..+1000h
  0-11 Not used (always zero) (Read only)

Bit30-12 are read/write-able, ie. they can be set/reset by software, however, that's normally not required - all bits are automatically reset at the begin of a new GTE command.
Bit31 is apparently intended for RTPS/RTPT commands, since it triggers only on flags that are affected by these two commands, but even for that commands it's totally useless since one could as well check if FLAG is nonzero.
Note: Writing 32bit values to 16bit GTE registers by software does not trigger any overflow/saturation flags (and does not do any saturation), eg. writing 12008900h (positive 32bit) to a signed 16bit register sets that register to FFFF8900h (negative 16bit).

GTE Opcode Summary

GTE Command Summary (sorted by Real Opcode bits) (bit0-5)

  Opc  Name   Clk Expl.
  00h  -          N/A (modifies similar registers than RTPS...)
  01h  RTPS   15  Perspective Transformation single
  0xh  -          N/A
  06h  NCLIP  8   Normal clipping
  0xh  -          N/A
  0Ch  OP(sf) 6   Outer product of 2 vectors
  0xh  -          N/A
  10h  DPCS   8   Depth Cueing single
  11h  INTPL  8   Interpolation of a vector and far color vector
  12h  MVMVA  8   Multiply vector by matrix and add vector (see below)
  13h  NCDS   19  Normal color depth cue single vector
  14h  CDP    13  Color Depth Que
  15h  -          N/A
  16h  NCDT   44  Normal color depth cue triple vectors
  1xh  -          N/A
  1Bh  NCCS   17  Normal Color Color single vector
  1Ch  CC     11  Color Color
  1Dh  -          N/A
  1Eh  NCS    14  Normal color single
  1Fh  -          N/A
  20h  NCT    30  Normal color triple
  2xh  -          N/A
  28h  SQR(sf)5   Square of vector IR
  29h  DCPL   8   Depth Cue Color light
  2Ah  DPCT   17  Depth Cueing triple (should be fake=08h, but isn't)
  2xh  -          N/A
  2Dh  AVSZ3  5   Average of three Z values
  2Eh  AVSZ4  6   Average of four Z values
  2Fh  -          N/A
  30h  RTPT   23  Perspective Transformation triple
  3xh  -          N/A
  3Dh  GPF(sf)5   General purpose interpolation
  3Eh  GPL(sf)5   General purpose interpolation with base
  3Fh  NCCT   39  Normal Color Color triple vector

Unknown if/what happens when using the "N/A" opcodes?

GTE Command Summary (sorted by Fake Opcode bits) (bit20-24)

The fake opcode number in bit20-24 has absolutely no effect on the hardware, it seems to be solely used to (or not to) confuse developers. Having the opcodes sorted by their fake numbers gives a more or less well arranged list:

  Fake Name   Clk Expl.
  00h  -          N/A
  01h  RTPS   15  Perspective Transformation single
  02h  RTPT   23  Perspective Transformation triple
  03h  -          N/A
  04h  MVMVA  8   Multiply vector by matrix and add vector (see below)
  05h  -          N/A
  06h  DCPL   8   Depth Cue Color light
  07h  DPCS   8   Depth Cueing single
  08h  DPCT   17  Depth Cueing triple (should be fake=08h, but isn't)
  09h  INTPL  8   Interpolation of a vector and far color vector
  0Ah  SQR(sf)5   Square of vector IR
  0Bh  -          N/A
  0Ch  NCS    14  Normal color single
  0Dh  NCT    30  Normal color triple
  0Eh  NCDS   19  Normal color depth cue single vector
  0Fh  NCDT   44  Normal color depth cue triple vectors
  10h  NCCS   17  Normal Color Color single vector
  11h  NCCT   39  Normal Color Color triple vector
  12h  CDP    13  Color Depth Que
  13h  CC     11  Color Color
  14h  NCLIP  8   Normal clipping
  15h  AVSZ3  5   Average of three Z values
  16h  AVSZ4  6   Average of four Z values
  17h  OP(sf) 6   Outer product of 2 vectors
  18h  -          N/A
  19h  GPF(sf)5   General purpose interpolation
  1Ah  GPL(sf)5   General purpose interpolation with base
  1Bh  -          N/A
  1Ch  -          N/A
  1Dh  -          N/A
  1Eh  -          N/A
  1Fh  -          N/A

For the sort-effect, DCPT should use fake=08h, but Sony seems to have accidently numbered it fake=0Fh in their devkit (giving it the same fake number as for NCDT). Also, "Wipeout 2097" accidently uses 0140006h (fake=01h and distorted bit18) instead of 1400006h (fake=14h) for NCLIP.

Additional Functions

The LZCS/LZCR registers offer a Count-Leading-Zeroes/Leading-Ones function.
The IRGB/ORGB registers allow to convert between 48bit and 15bit RGB colors.
These registers work without needing to send any COP2 commands. However, unlike for commands (which do automatically halt the CPU when needed), one must insert dummy opcodes between writing and reading the registers.

GTE Coordinate Calculation Commands

COP2 0180001h - 15 Cycles - RTPS - Perspective Transformation (single)

COP2 0280030h - 23 Cycles - RTPT - Perspective Transformation (triple)

RTPS performs final Rotate, translate and perspective transformation on vertex V0. Before writing to the FIFOs, the older entries are moved one stage down. RTPT is same as RTPS, but repeats for V1 and V2. The "sf" bit should be usually set.

  IR1 = MAC1 = (TRX*1000h + RT11*VX0 + RT12*VY0 + RT13*VZ0) SAR (sf*12)
  IR2 = MAC2 = (TRY*1000h + RT21*VX0 + RT22*VY0 + RT23*VZ0) SAR (sf*12)
  IR3 = MAC3 = (TRZ*1000h + RT31*VX0 + RT32*VY0 + RT33*VZ0) SAR (sf*12)
  SZ3 = MAC3 SAR ((1-sf)*12)                           ;ScreenZ FIFO 0..+FFFFh
  MAC0=(((H*20000h/SZ3)+1)/2)*IR1+OFX, SX2=MAC0/10000h ;ScrX FIFO -400h..+3FFh
  MAC0=(((H*20000h/SZ3)+1)/2)*IR2+OFY, SY2=MAC0/10000h ;ScrY FIFO -400h..+3FFh
  MAC0=(((H*20000h/SZ3)+1)/2)*DQA+DQB, IR0=MAC0/1000h  ;Depth cueing 0..+1000h

If the result of the "(((H*20000h/SZ3)+1)/2)" division is greater than 1FFFFh, then the division result is saturated to +1FFFFh, and the divide overflow bit in the FLAG register gets set; that happens if the vertex is exceeding the "near clip plane", ie. if it is very close to the camera (SZ3<=H/2), exactly at the camara position (SZ3=0), or behind the camera (negative Z coordinates are saturated to SZ3=0). For details on the division, see:
GTE Division Inaccuracy
For "far plane clipping", one can use the SZ3 saturation flag (MaxZ=FFFFh), or the IR3 saturation flag (MaxZ=7FFFh) (eg. used by Wipeout 2097), or one can compare the SZ3 value with any desired MaxZ value by software.
Note: The command does saturate IR1,IR2,IR3 to -8000h..+7FFFh (regardless of lm bit). When using RTP with sf=0, then the IR3 saturation flag (FLAG.22) gets set <only> if "MAC3 SAR 12" exceeds -8000h..+7FFFh (although IR3 is saturated when "MAC3" exceeds -8000h..+7FFFh).

COP2 1400006h - 8 Cycles - NCLIP - Normal clipping

  MAC0 =   SX0*SY1 + SX1*SY2 + SX2*SY0 - SX0*SY2 - SX1*SY0 - SX2*SY1

The sign of the result indicates whether the polygon coordinates are arranged clockwise or anticlockwise (ie. whether the front side or backside is visible). If the result is zero, then it's neither one (ie. the vertices are all arranged in a straight line). Note: The GPU probably renders straight lines as invisble 0 pixel width lines?

COP2 158002Dh - 5 Cycles - AVSZ3 - Average of three Z values (for Triangles)

COP2 168002Eh - 6 Cycles - AVSZ4 - Average of four Z values (for Quads)

  MAC0 =  ZSF3*(SZ1+SZ2+SZ3)       ;for AVSZ3
  MAC0 =  ZSF4*(SZ0+SZ1+SZ2+SZ3)   ;for AVSZ4
  OTZ  =  MAC0/1000h               ;for both (saturated to 0..FFFFh)

Adds three or four Z values together and multplies them by a fixed point value. The result can be used as index in the GPU's Ordering Table (OT).
GPU Depth Ordering
The scaling factors would be usually ZSF3=N/30h and ZSF4=N/40h, where "N" is the number of entries in the OT (max 10000h). SZn and OTZ are unsigned 16bit values, for whatever reason ZSFn registers are signed 16bit values (negative values would allow a negative result in MAC0, but would saturate OTZ to zero).

GTE General Purpose Calculation Commands

COP2 0400012h - 8 Cycles - MVMVA(sf,mx,v,cv,lm)

Multiply vector by matrix and vector addition.

  Mx = matrix specified by mx  ;RT/LLM/LCM - Rotation, light or color matrix
  Vx = vector specified by v   ;V0, V1, V2, or [IR1,IR2,IR3]
  Tx = translation vector specified by cv  ;TR or BK or Bugged/FC, or None

Calculation:

  MAC1 = (Tx1*1000h + Mx11*Vx1 + Mx12*Vx2 + Mx13*Vx3) SAR (sf*12)
  MAC2 = (Tx2*1000h + Mx21*Vx1 + Mx22*Vx2 + Mx23*Vx3) SAR (sf*12)
  MAC3 = (Tx3*1000h + Mx31*Vx1 + Mx32*Vx2 + Mx33*Vx3) SAR (sf*12)
  [IR1,IR2,IR3] = [MAC1,MAC2,MAC3]

Multiplies a vector with either the rotation matrix, the light matrix or the color matrix and then adds the translation vector or background color vector.
The GTE also allows selection of the far color vector (FC), but this vector is not added correctly by the hardware: The return values are reduced to the last portion of the formula, ie. MAC1=(Mx13*Vx3) SAR (sf*12), and similar for MAC2 and MAC3, nethertheless, some bits in the FLAG register seem to be adjusted as if the full operation would have been executed. Setting Mx=3 selects a garbage matrix (with elements -60h, +60h, IR0, RT13, RT13, RT13, RT22, RT22, RT22).

COP2 0A00428h+sf*80000h - 5 Cycles - SQR(sf) - Square vector

  [MAC1,MAC2,MAC3] = [IR1*IR1,IR2*IR2,IR3*IR3] SHR (sf*12)
  [IR1,IR2,IR3]    = [MAC1,MAC2,MAC3]    ;IR1,IR2,IR3 saturated to max 7FFFh

Calculates the square of a vector. The result is, of course, always positive, so the "lm" flag for negative saturation has no effect.

COP2 170000Ch+sf*80000h - 6 Cycles - OP(sf,lm) - Outer product of 2 vectors

  [MAC1,MAC2,MAC3] = [IR3*D2-IR2*D3, IR1*D3-IR3*D1, IR2*D1-IR1*D2] SAR (sf*12)
  [IR1,IR2,IR3]    = [MAC1,MAC2,MAC3]                        ;copy result

Calculates the outer product of two signed 16bit vectors. Note: D1,D2,D3 are meant to be the RT11,RT22,RT33 elements of the RT matrix "misused" as vector. lm should be usually zero.

LZCS/LZCR registers - ? Cycles - Count-Leading-Zeroes/Leading-Ones

The LZCS/LZCR registers offer a Count-Leading-Zeroes/Leading-Ones function.

GTE Color Calculation Commands

COP2 0C8041Eh - 14 Cycles - NCS - Normal color (single)

COP2 0D80420h - 30 Cycles - NCT - Normal color (triple)

COP2 108041Bh - 17 Cycles - NCCS - Normal Color Color (single vector)

COP2 118043Fh - 39 Cycles - NCCT - Normal Color Color (triple vector)

COP2 0E80413h - 19 Cycles - NCDS - Normal color depth cue (single vector)

COP2 0F80416h - 44 Cycles - NCDT - Normal color depth cue (triple vectors)

In: V0=Normal vector (for triple variants repeated with V1 and V2), BK=Background color, RGBC=Primary color/code, LLM=Light matrix, LCM=Color matrix, IR0=Interpolation value.

  [IR1,IR2,IR3] = [MAC1,MAC2,MAC3] = (LLM*V0) SAR (sf*12)
  [IR1,IR2,IR3] = [MAC1,MAC2,MAC3] = (BK*1000h + LCM*IR) SAR (sf*12)
  [MAC1,MAC2,MAC3] = [R*IR1,G*IR2,B*IR3] SHL 4          ;<--- for NCDx/NCCx
  [MAC1,MAC2,MAC3] = MAC+(FC-MAC)*IR0                   ;<--- for NCDx only
  [MAC1,MAC2,MAC3] = [MAC1,MAC2,MAC3] SAR (sf*12)       ;<--- for NCDx/NCCx
  Color FIFO = [MAC1/16,MAC2/16,MAC3/16,CODE], [IR1,IR2,IR3] = [MAC1,MAC2,MAC3]

COP2 138041Ch - 11 Cycles - CC(lm=1) - Color Color

COP2 1280414h - 13 Cycles - CDP(...) - Color Depth Que

In: [IR1,IR2,IR3]=Vector, RGBC=Primary color/code, LCM=Color matrix, BK=Background color, and, for CDP, IR0=Interpolation value, FC=Far color.

  [IR1,IR2,IR3] = [MAC1,MAC2,MAC3] = (BK*1000h + LCM*IR) SAR (sf*12)
  [MAC1,MAC2,MAC3] = [R*IR1,G*IR2,B*IR3] SHL 4
  [MAC1,MAC2,MAC3] = MAC+(FC-MAC)*IR0                   ;<--- for CDP only
  [MAC1,MAC2,MAC3] = [MAC1,MAC2,MAC3] SAR (sf*12)
  Color FIFO = [MAC1/16,MAC2/16,MAC3/16,CODE], [IR1,IR2,IR3] = [MAC1,MAC2,MAC3]

COP2 0680029h - 8 Cycles - DCPL - Depth Cue Color light

COP2 0780010h - 8 Cycles - DPCS - Depth Cueing (single)

COP2 0x8002Ah - 17 Cycles - DPCT - Depth Cueing (triple)

COP2 0980011h - 8 Cycles - INTPL - Interpolation of a vector and far color

In: [IR1,IR2,IR3]=Vector, FC=Far Color, IR0=Interpolation value, CODE=MSB of RGBC, and, for DCPL, R,G,B=LSBs of RGBC.

  [MAC1,MAC2,MAC3] = [R*IR1,G*IR2,B*IR3] SHL 4          ;<--- for DCPL only
  [MAC1,MAC2,MAC3] = [IR1,IR2,IR3] SHL 12               ;<--- for INTPL only
  [MAC1,MAC2,MAC3] = [R,G,B] SHL 16                     ;<--- for DPCS/DPCT
  [MAC1,MAC2,MAC3] = MAC+(FC-MAC)*IR0
  [MAC1,MAC2,MAC3] = [MAC1,MAC2,MAC3] SAR (sf*12)
  Color FIFO = [MAC1/16,MAC2/16,MAC3/16,CODE], [IR1,IR2,IR3] = [MAC1,MAC2,MAC3]

DPCT executes thrice, and reads the R,G,B values from RGB0 (ie. reads from the Bottom of the Color FIFO, instead of from the RGBC register) (the CODE value is kept read from RGBC as usually), so, after DPCT execution, the RGB0,RGB1,RGB2 Fifo entries are modified.

COP2 190003Dh - 5 Cycles - GPF(sf,lm) - General purpose Interpolation

COP2 1A0003Eh - 5 Cycles - GPL(sf,?) - General Interpolation with base

  [MAC1,MAC2,MAC3] = [0,0,0]                            ;<--- for GPF only
  [MAC1,MAC2,MAC3] = [MAC1,MAC2,MAC3] SHL (sf*12)       ;<--- for GPL only
  [MAC1,MAC2,MAC3] = (([IR1,IR2,IR3] * IR0) + [MAC1,MAC2,MAC3]) SAR (sf*12)
  Color FIFO = [MAC1/16,MAC2/16,MAC3/16,CODE], [IR1,IR2,IR3] = [MAC1,MAC2,MAC3]

Note: Although the SHL in GPL is theoretically undone by the SAR, 44bit overflows can occur internally when sf=1.

Details on "MAC+(FC-MAC)*IR0"

  [IR1,IR2,IR3] = (([RFC,GFC,BFC] SHL 12) - [MAC1,MAC2,MAC3]) SAR (sf*12)
  [MAC1,MAC2,MAC3] = (([IR1,IR2,IR3] * IR0) + [MAC1,MAC2,MAC3])

Note: Above "[IR1,IR2,IR3]=(FC-MAC)" is saturated to -8000h..+7FFFh (ie. as if lm=0), anyways, further writes to [IR1,IR2,IR3] (within the same command) are saturated as usually (ie. depening on lm setting).

Details on "(LLM*V0) SAR (sf*12)" and "(BK*1000h + LCM*IR) SAR (sf*12)"

Works like MVMVA command (see there), but with fixed Tx/Vx/Mx parameters, the sf/lm bits can be changed and do affect the results (although normally both bits should be set for use with color matrices).

Notes

The 8bit RGB values written to the top of Color Fifo are the 32bit MACn values divided by 16, and saturated to +00h..+FFh, and of course, the older Fifo entries are moved downwards. Note that, at the GPU side, the meaning of the RGB values depends on whether or not texture blending is used (for untextured polygons FFh is max brightness) (for texture blending FFh is double brightness and 80h is normal brightness).
The 8bit CODE value is intended to contain a GP0(20h..7Fh) Rendering command, allowing to automatically merge the 8bit command number, with the 24bit color value.
The IRGB/ORGB registers allow to convert between 48bit and 15bit RGB colors.
Although the result of the commands in this chapter is written to the Color FIFO, some commands like GPF/GPL may be also used for other purposes (eg. to scale or scale/translate single vertices).

GTE Division Inaccuracy

GTE Division Inaccuracy (for RTPS/RTPT commands)

Basically, the GTE division does (attempt to) work as so (using 33bit maths):

  n = (((H*20000h/SZ3)+1)/2)

alternatly, below would give (almost) the same result (using 32bit maths):

  n = ((H*10000h+SZ3/2)/SZ3)

in both cases, the result is saturated about as so:

  if n>1FFFFh or division_by_zero then n=1FFFFh, FLAG.Bit17=1, FLAG.Bit31=1

However, the real GTE hardware is using a fast, but less accurate division mechanism (based on Unsigned Newton-Raphson (UNR) algorithm):

  if (H < SZ3*2) then                            ;check if overflow
    z = count_leading_zeroes(SZ3)                ;z=0..0Fh (for 16bit SZ3)
    n = (H SHL z)                                ;n=0..7FFF8000h
    d = (SZ3 SHL z)                              ;d=8000h..FFFFh
    u = unr_table[(d-7FC0h) SHR 7] + 101h        ;u=200h..101h
    d = ((2000080h - (d * u)) SHR 8)             ;d=10000h..0FF01h
    d = ((0000080h + (d * u)) SHR 8)             ;d=20000h..10000h
    n = min(1FFFFh, (((n*d) + 8000h) SHR 16))    ;n=0..1FFFFh
  else n = 1FFFFh, FLAG.Bit17=1, FLAG.Bit31=1    ;n=1FFFFh plus overflow flag

the GTE's unr_table[000h..100h] consists of following values:

  FFh,FDh,FBh,F9h,F7h,F5h,F3h,F1h,EFh,EEh,ECh,EAh,E8h,E6h,E4h,E3h ;\
  E1h,DFh,DDh,DCh,DAh,D8h,D6h,D5h,D3h,D1h,D0h,CEh,CDh,CBh,C9h,C8h ; 00h..3Fh
  C6h,C5h,C3h,C1h,C0h,BEh,BDh,BBh,BAh,B8h,B7h,B5h,B4h,B2h,B1h,B0h ;
  AEh,ADh,ABh,AAh,A9h,A7h,A6h,A4h,A3h,A2h,A0h,9Fh,9Eh,9Ch,9Bh,9Ah ;/
  99h,97h,96h,95h,94h,92h,91h,90h,8Fh,8Dh,8Ch,8Bh,8Ah,89h,87h,86h ;\
  85h,84h,83h,82h,81h,7Fh,7Eh,7Dh,7Ch,7Bh,7Ah,79h,78h,77h,75h,74h ; 40h..7Fh
  73h,72h,71h,70h,6Fh,6Eh,6Dh,6Ch,6Bh,6Ah,69h,68h,67h,66h,65h,64h ;
  63h,62h,61h,60h,5Fh,5Eh,5Dh,5Dh,5Ch,5Bh,5Ah,59h,58h,57h,56h,55h ;/
  54h,53h,53h,52h,51h,50h,4Fh,4Eh,4Dh,4Dh,4Ch,4Bh,4Ah,49h,48h,48h ;\
  47h,46h,45h,44h,43h,43h,42h,41h,40h,3Fh,3Fh,3Eh,3Dh,3Ch,3Ch,3Bh ; 80h..BFh
  3Ah,39h,39h,38h,37h,36h,36h,35h,34h,33h,33h,32h,31h,31h,30h,2Fh ;
  2Eh,2Eh,2Dh,2Ch,2Ch,2Bh,2Ah,2Ah,29h,28h,28h,27h,26h,26h,25h,24h ;/
  24h,23h,22h,22h,21h,20h,20h,1Fh,1Eh,1Eh,1Dh,1Dh,1Ch,1Bh,1Bh,1Ah ;\
  19h,19h,18h,18h,17h,16h,16h,15h,15h,14h,14h,13h,12h,12h,11h,11h ; C0h..FFh
  10h,0Fh,0Fh,0Eh,0Eh,0Dh,0Dh,0Ch,0Ch,0Bh,0Ah,0Ah,09h,09h,08h,08h ;
  07h,07h,06h,06h,05h,05h,04h,04h,03h,03h,02h,02h,01h,01h,00h,00h ;/
  00h    ;<-- one extra table entry (for "(d-7FC0h)/80h"=100h)    ;-100h

Above can be generated as "unr_table[i]=min(0,(40000h/(i+100h)+1)/2-101h)".
Some special cases: NNNNh/0001h uses a big multiplier (d=20000h), in practice, this can occur only for 0000h/0001h and 0001h/0001h (due to the H<SZ3*2 overflow check).
The min(1FFFFh) limit is needed for cases like FE3Fh/7F20h, F015h/780Bh, etc. (these do produce UNR result 20000h, and are saturated to 1FFFFh, but without setting overflow FLAG bits).