SoftGPU: Rasterize triangles in chunks of 4 pixels #9635

unknownbrackets · 2017-04-23T14:30:06Z

Currently, this is generally a bit slower, but it's a step in the right direction.

The last commit shows the benefit of this change in one area. Sample performance change in Tales of Destiny 2 (before this PR -> after this PR w/throughmode perf):

130% -> 200% - Logos during intro
35% -> 70% - Load save screen
64% -> 50% - 3D overworld

-[Unknown]

hrydgard · 2017-04-23T17:04:54Z

I like it. Good prep for mipmapping. As this clearly shows, doing SIMD where you simply write it like the straightline code, but with one component from each pixel in each lane, becomes quite easy, and once fully applied there's no way this won't be faster than doing it a single pixel at a time.

Buildbots are a little unhappy though.

Not very optimal yet.

This is significantly faster.

unknownbrackets · 2017-04-23T17:55:04Z

Sure, although it can be a bit of a pain. We still have some sort of skew issue - a (0, 11)-(0, 11) 1:1 draw doesn't actually draw 1:1 in Crisis Core (The cross button symbol in the bottom right - in nearest.) I guess that means #8282 didn't handle all the cases.

But, probably better to start fixing these things in a four-pixel pipeline anyway.

-[Unknown]

hrydgard · 2017-04-23T18:04:04Z

GPU/Math3D.h

@@ -634,6 +634,13 @@ class Vec4
 		return Vec4(VecClamp(x, l, h), VecClamp(y, l, h), VecClamp(z, l, h), VecClamp(w, l, h));
 	}

+	Vec4 Reciprocal() const
+	{
+		// In case we use doubles, maintain accuracy.


Can you clarify this? If T is a double, 1.0f will just be automatically cast to 1.0 and the division will be performed at double precision. Is this intended or not? I'm confused :)

Sorry, was worrying about the accuracy problems and trying to mess with values to fix things - removed the comment.

-[Unknown]

hrydgard · 2017-04-23T18:08:23Z

Hm, I was thinking (this is not a call for action, just thoughts for the future): instead of using Vec4 across the lanes everywhere, an alternate way of formulating the math might be to use Vec2, Vec3, and Vec4 composed out of __m128. Like, Vec4<__m128>, then you can still perform "vector operations" and use various operator overloads etc while ignoring the fact that you're doing it for four pixels at a time.

And you'd have a type like "scalar" which would be used where a single float is currently used, just an __m128 with overloads like a Vec4 just not named so.

Not sure how confusing that would be though.

unknownbrackets · 2017-04-23T18:21:26Z

Well, that sounds like it'd make the non-SSE paths more complicated.

I'd actually like to move the pipeline to jit, at first built as a chain of func calls (like vertexjit or more like MIPS Comp_Generic really), in steps. Then we can construct a key, select a jit program from a cache, and run it.

In that scenario, it might be ideal to use 16 u8s for colors, or maybe two pairs of 8 u16s to simplify blending. But not sure. Want to be mindful of available regs.

-[Unknown]

hrydgard · 2017-04-23T18:35:56Z

Yeah, good points - lots of this can definitely be done at 16-bit or 8-bit.

unknownbrackets · 2017-05-11T04:08:31Z

A little experiment:
https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:samplerjit?expand=1

Just wanted to try it quickly and texel lookup was a nice self-contained piece. A bit underwhelming (considering ApplyTexturing is typically 20-40% of wall time), about 10% FPS improvement at best. Not terribly optimal though, and obviously would want to at least decode 16-bit directly to xmms (maybe via a jit ABI, and 4 texels at a time.)

The best profiling results were SampleNearest 21% -> SamplerJit 9% in Hexyz Force (at barely 64 FPS.) Probably need a "texture cache" for better performance...

-[Unknown]

hrydgard · 2017-05-11T06:18:55Z

If ApplyTexturing was 20% and you got a 10% total improvement, that means that you approximately doubled the speed of texel fetching, which isn't too bad still. But yeah, would also have expected a little better than that...

unknownbrackets · 2017-05-11T13:13:59Z

I wonder if rather than linear sampling from [(u,v), (u+1,v), (u,v+1), (u+1,v+1)] we instead just always sampled from even odd - u0 = u & ~1; v0 = v & ~1;. Based on swizzling I would sorta assume this is what the hardware MIGHT do?

If we did that, it should be possible to simply calculate all 4 addresses after the first one without much effort...

-[Unknown]

hrydgard · 2017-05-11T14:25:44Z

Not sure I understand what you mean. The texture coordinates are often very dissimilar from the pixel locations on the screen, imagine any perspective mapping or a rotated mapping.

Of course when drawing 1:1 rectangles, there are many possible optimizations including skipping the UV calculations altogether.

unknownbrackets · 2017-05-11T15:43:07Z

I mean when doing linear sampling (the 4 samples used to interpolate.) Currently it does winding:

https://github.com/hrydgard/ppsspp/blob/master/GPU/Software/Rasterizer.cpp#L209

I don't mean when drawing multiple pixels, this is for just one pixel.

-[Unknown]

hrydgard · 2017-05-11T17:44:37Z

Right, some simplification may be possible. You only need to calculate one address to fetch from, and then just offset by 1 horizontally and (texw) vertically to get the other three - if it weren't for wrapping and clamping which might have you fetch from either the same address, or alternatively from the other side of the texture. Not sure how to do this in the most elegant way.

unknownbrackets · 2017-05-11T19:36:42Z

Well, my point is that if the U and V are even, then you're guaranteed:

U+1 will always be available by incrementing a few bytes.
V+1 will always be available based on bufw (not swizzled) or a fixed number (if swizzled.)

(you are not guaranteed these things if U or V are odd - in that case +1 might go to a new tile, when swizzled, so you end up needing to re-examine U and V.)

Wrapping and clamping won't cause problems in that case unless it's a 1x1 mip level, which can be special cased (since they are power of two sized.) In that case one might early-out of linear sampling anyway.

So if we (in linear filtering only) always sample based on even and off UVs, things get much simpler for sampling all four at once.

-[Unknown]

hrydgard · 2017-05-11T20:52:02Z

But that doesn't really work, does it? Let's imagine one dimension texture t[], and your sole texture coordinate U is:

0.5  :     We need lerp(t[0], t[1], 0.5).  Works!
2.25 :     We need lerp(t[2], t[3], 0.25). Works!
1.75 :     We need lerp(t[1], t[2], 0.75). Ooops... 1 is odd.

Or are you saying that we'll get around that by rewriting the last equation to lerp[t[2], t[1], 0.25) by xoring the indices by the low bit and 1.0-x the lerp factor?

unknownbrackets · 2017-05-11T21:21:31Z

D'oh, right. I wasn't thinking about the lerp later of course. I'm stupid.

-[Unknown]

unknownbrackets · 2017-05-15T14:06:20Z

Interestingly, I found that with samplerjit, the thread loop (which is really naive) is mostly just waiting longer. I wonder if I have a threading bug somehow, or if it's just showing the naivety of slicing by y...

We could probably "bin" and trivially discard based on say 60x68 tiles or something, right?

-[Unknown]

hrydgard · 2017-05-15T15:37:24Z

That would explain some lack of speedup, yeah...

Tiled binning is a good way to go for multithreading rendering, definitely better than slicing by Y if you have many small triangles, which we generally do. Finding the optimal tile size is gonna be quite some trial and error though, I'm sure.

unknownbrackets · 2017-05-21T23:23:41Z

A bit better now with linear in the jit, but just not much faster...

master...unknownbrackets:samplerjit

Less rounding errors this way, though.

-[Unknown]

unknownbrackets requested a review from hrydgard April 23, 2017 14:30

unknownbrackets force-pushed the softgpu branch from cd9d58f to 66eee29 Compare April 23, 2017 17:35

unknownbrackets added 3 commits April 23, 2017 10:37

SoftGPU: Use texture bufw in bytes.

7112cdc

SoftGPU: Rasterize triangles in chunks of 4 pixels.

3142462

Not very optimal yet.

SoftGPU: Interpolate through texturing better.

81ee2e9

This is significantly faster.

unknownbrackets force-pushed the softgpu branch from 66eee29 to 2dc1118 Compare April 23, 2017 17:42

hrydgard reviewed Apr 23, 2017

View reviewed changes

SoftGPU: Grab 4 S/T coords in non-through too.

4fb7e43

unknownbrackets force-pushed the softgpu branch from 2dc1118 to 4fb7e43 Compare April 23, 2017 18:11

hrydgard merged commit 4f0e1a0 into hrydgard:master Apr 23, 2017

unknownbrackets deleted the softgpu branch April 23, 2017 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SoftGPU: Rasterize triangles in chunks of 4 pixels #9635

SoftGPU: Rasterize triangles in chunks of 4 pixels #9635

unknownbrackets commented Apr 23, 2017

hrydgard commented Apr 23, 2017

unknownbrackets commented Apr 23, 2017

hrydgard Apr 23, 2017

unknownbrackets Apr 23, 2017

hrydgard commented Apr 23, 2017 •

edited

Loading

unknownbrackets commented Apr 23, 2017 •

edited

Loading

hrydgard commented Apr 23, 2017

unknownbrackets commented May 11, 2017

hrydgard commented May 11, 2017

unknownbrackets commented May 11, 2017

hrydgard commented May 11, 2017 •

edited

Loading

unknownbrackets commented May 11, 2017

hrydgard commented May 11, 2017

unknownbrackets commented May 11, 2017 •

edited

Loading

hrydgard commented May 11, 2017 •

edited

Loading

unknownbrackets commented May 11, 2017

unknownbrackets commented May 15, 2017

hrydgard commented May 15, 2017

unknownbrackets commented May 21, 2017

SoftGPU: Rasterize triangles in chunks of 4 pixels #9635

SoftGPU: Rasterize triangles in chunks of 4 pixels #9635

Conversation

unknownbrackets commented Apr 23, 2017

hrydgard commented Apr 23, 2017

unknownbrackets commented Apr 23, 2017

hrydgard Apr 23, 2017

Choose a reason for hiding this comment

unknownbrackets Apr 23, 2017

Choose a reason for hiding this comment

hrydgard commented Apr 23, 2017 • edited Loading

unknownbrackets commented Apr 23, 2017 • edited Loading

hrydgard commented Apr 23, 2017

unknownbrackets commented May 11, 2017

hrydgard commented May 11, 2017

unknownbrackets commented May 11, 2017

hrydgard commented May 11, 2017 • edited Loading

unknownbrackets commented May 11, 2017

hrydgard commented May 11, 2017

unknownbrackets commented May 11, 2017 • edited Loading

hrydgard commented May 11, 2017 • edited Loading

unknownbrackets commented May 11, 2017

unknownbrackets commented May 15, 2017

hrydgard commented May 15, 2017

unknownbrackets commented May 21, 2017

hrydgard commented Apr 23, 2017 •

edited

Loading

unknownbrackets commented Apr 23, 2017 •

edited

Loading

hrydgard commented May 11, 2017 •

edited

Loading

unknownbrackets commented May 11, 2017 •

edited

Loading

hrydgard commented May 11, 2017 •

edited

Loading