Optimised compound clouds by xfractalino · Pull Request #6962 · Revolutionary-Games/Thrive

xfractalino · 2026-04-30T13:27:26Z

This replaces the repeated SetPixel calls inside the compound clouds when copying data to the image, as it resulted in an important overhead as shown in flame graphs.
Now each cloud has a buffer, and data is wrote into this buffer before calling SetData to copy it inside the image.
It also replaces scalar logic with SIMD instructions in hotspot and removes the image copy step by using a staging buffer that's written to in the generation logic.

BEFORE:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 130
Absorber score: 129.902
Many spawners score: 124.373
Cloud sim multiplier before under 60 FPS: 3.2615
Stress test spawners: 38
Stress test average FPS: 114.786
Stress test min FPS: 36
Total test duration: 135.4s
CPU: Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz (used tasks: 4, native: 4, sim threads: True)
GPU: NVIDIA GeForce GTX 1050 Ti
OS: Windows

~~AFTER OPTIMISATION:~~

~~Benchmark results for CloudBenchmark v1~~
~~Resolution divisor: 2~~
~~Cloud spawn score: 146~~
~~Absorber score: 147.333~~
~~Many spawners score: 145~~
~~Cloud sim multiplier before under 60 FPS: 3.3154~~
~~Stress test spawners: 39~~
~~Stress test average FPS: 142.337~~
~~Stress test min FPS: 55~~
~~Total test duration: 135.3s~~

Update. The latest benchmark is:

AFTER OPTIMISATION:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 276
Absorber score: 274.706
Many spawners score: 258.784
Cloud sim multiplier before under 60 FPS: 6.4615
Stress test spawners: 90
Stress test average FPS: 193.82
Stress test min FPS: 45
Total test duration: 218.1s

Progress Checklist

Note: before starting this checklist the PR should be marked as non-draft.

PR author has checked that this PR works as intended and doesn't
break existing features:
https://wiki.revolutionarygamesstudio.com/wiki/Testing_Checklist
(this is important as to not waste the time of Thrive team
members reviewing this PR). This includes gameplay testing by the PR author.
Initial code review passed (this and further items should not be checked by the PR author)
Functionality is confirmed working by another person (see above checklist link)
Final code review is passed and code conforms to the
styleguide.

Before merging all CI jobs should finish on this PR without errors, if
there are automatically detected style issues they should be fixed by
the PR author. Merging must follow our
styleguide.

…atedly

hhyyrylainen · 2026-04-30T13:29:47Z

Looks like my comments don't show on that commit's page directly, so I'll copy them here for future reference:

Pretty sure this won't be renting from the pool as there's a max size after which the allocation just goes to new immediately... (as the cloud buffer should be megabytes large).

So allocating a temp buffer and only resizing it when necessary would be the better option.

These use tasks because the update image calls with pixels were slow. So now that we have our own temporary buffer, writing directly to that by the data generation tasks would probably be a lot faster (due to data locality).

Patryk26g · 2026-04-30T16:59:48Z

I can confirm that in benchmarks I got a few % increase in performance

xfractalino · 2026-04-30T18:48:51Z

Tomorrow I'll benchmark following @hhyyrylainen 's advice of writing data directly to the buffer and avoiding this copy completely.

xfractalino · 2026-05-01T10:58:46Z

I tried to move the write calls inside the advection step as suggested, but it's actually much slower due to how it's implemented. I tried to change the advection to a semi-lagrangian (neighbours to cell instead of cell to neighbours) but it's slower because the former (current implementation) skips empty cells effectively.

So, I think that keeping the copy step is necessary as it seems to be the best approach.

xfractalino · 2026-05-01T11:06:09Z

I ran another benchmark on the last commit and here's the results:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 157
Absorber score: 155.882
Many spawners score: 149.373
Cloud sim multiplier before under 60 FPS: 3.6692
Stress test spawners: 45
Stress test average FPS: 140.705
Stress test min FPS: 57
Total test duration: 144.6s
CPU: Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz (used tasks: 4, native: 4, sim threads: True)
GPU: NVIDIA GeForce GTX 1050 Ti
OS: Windows

hhyyrylainen · 2026-05-01T11:44:16Z

Is it still faster to do the copy from multiple threads? As I believe it should be a straightforward copy between buffers now, so even a single core should be able to just saturate the RAM -> CPU -> RAM bandwidth due to memory prefetching. So I think it would be well worth investigating just doing the final copy with a single straightforward piece of code without tasks.

xfractalino · 2026-05-01T12:05:06Z

I tried to copy from a single thread, and it appears to be slower:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 102
Absorber score: 102.255
Many spawners score: 99.039
Cloud sim multiplier before under 60 FPS: 2.6385
Stress test spawners: 28
Stress test average FPS: 98.807
Stress test min FPS: 54
Total test duration: 118.3s

But I scheduled one task per cloud, so that the whole buffer is prefetched and all the cores are used, and it's pretty much comparable to master:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 132
Absorber score: 130.863
Many spawners score: 126.098
Cloud sim multiplier before under 60 FPS: 3.2615
Stress test spawners: 38
Stress test average FPS: 118.592
Stress test min FPS: 36
Total test duration: 134.9s

So I think there's not much benefit in keeping the copy on a single core, and it's arguably worse than the current last commit.

I think the reason is that the buffers are too big to fit in the CPU cache.

xfractalino · 2026-05-01T14:43:05Z

I changed how the clouds are sliced (from squares to slices to ensure cache locality), flattened the density arrays from 2D and used SIMD to copy rapidly from the density arrays to the buffer.

New results:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 171
Absorber score: 168.235
Many spawners score: 164.255
Cloud sim multiplier before under 60 FPS: 3.7154
Stress test spawners: 46
Stress test average FPS: 155.092
Stress test min FPS: 50
Total test duration: 145.9s

xfractalino · 2026-05-01T15:14:53Z

Even better results by moving copy inside advection

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 196
Absorber score: 196.882
Many spawners score: 192
Cloud sim multiplier before under 60 FPS: 4.7308
Stress test spawners: 62
Stress test average FPS: 173.024
Stress test min FPS: 50
Total test duration: 172s

xfractalino · 2026-05-02T09:51:30Z

Implementing SIMD in the diffusion algorithm yields even better results:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 252
Absorber score: 253.882
Many spawners score: 242.255
Cloud sim multiplier before under 60 FPS: 6.0923
Stress test spawners: 84
Stress test average FPS: 187.601
Stress test min FPS: 59
Total test duration: 208.2s

hhyyrylainen · 2026-05-04T07:56:39Z

+                    vPixel = Vector128.Multiply(vPixel, vScale);
+                    vPixel = Vector128.Min(Vector128.Max(vPixel, vZero), v255);
+                    var vInt = Vector128.ConvertToInt32(vPixel);
+                    var packed16 = Sse2.PackSignedSaturate(vInt, vInt);


Hopefully Sse2 works on Apple Silicon, if not this needs an alternative path for ARM / no SSE support.

Rosetta2 (Apple Silicon) should fully support Sse2

We actually don't use Rosetta, Thrive runs natively as ARM code on the newest Macs. So Rosetta working or not is irrelevant for our case.

Okay I was taking a look at the docs and Vector128 is supported on Arm64, so I think we should be safe on Apple Silicon. In the advection algorithm there's no issue because we already check that and fall back to the scalar algorithm.

This line does use Sse2 class directly, which is what I'm worried about.

Which is why I'm worried as System.Runtime.Intrinsics.X86.Sse2 is the namespace it is in so I think it might be unsupported on Arm (as ARM intrinsics are under Intrinsics.Arm namespace).

This makes sense. I should make a check and switch to AdvSimd.SaturatingNarrowingUnsignedLower as the docs suggest, though I want to make sure the result is the same.

hhyyrylainen · 2026-05-04T08:42:51Z

I think there's still a few comments that are unsolved. I'll mark the ones that are solved now though.

xfractalino · 2026-05-04T09:10:33Z

The benchmarks are slightly better without forcing inlining the ProcessPixelAdvection method

Without forcing inlining:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 277
Absorber score: 281.216
Many spawners score: 261.784
Cloud sim multiplier before under 60 FPS: 6.1538
Stress test spawners: 85
Stress test average FPS: 221.733
Stress test min FPS: 52
Total test duration: 209.9s

Forcing inlining:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 277
Absorber score: 274.588
Many spawners score: 257.294
Cloud sim multiplier before under 60 FPS: 6.1231
Stress test spawners: 85
Stress test average FPS: 209.188
Stress test min FPS: 43
Total test duration: 209.1s

hhyyrylainen

Seems like according to latest testing reports this is now functioning correctly.
And the code still seems about the same as when I last reviewed. I did not re-test on Mac but hopefully nothing changed that could have impacted that.

$@xfractalino$

Optimised image write to use SetData instead of calling SetPixel repe…

d5a1aa4

…atedly

github-project-automation Bot added this to Thrive Planning Apr 30, 2026

github-project-automation Bot moved this to In progress in Thrive Planning Apr 30, 2026

$@xfractalino$

Fixed memory leak

a2bf367

hhyyrylainen added the review label Apr 30, 2026

hhyyrylainen added this to the Release 1.1.0 milestone Apr 30, 2026

xfractalino added 2 commits April 30, 2026 15:35

$@xfractalino$

Switched to resizing the buffer

12f08aa

$@xfractalino$

Removed pooling

287b3a7

xfractalino added 2 commits May 1, 2026 15:59

$@xfractalino$

Changed how the clouds are sliced, flattened arrays and used SIMD

cc02beb

$@xfractalino$

Polishing

59e2850

$@xfractalino$

Abolished copy step by moving copy into advection loop

dcf1fed

$@xfractalino$ xfractalino requested review from a team, Patryk26g and hhyyrylainen May 1, 2026 15:17

$@xfractalino$ xfractalino changed the title ~~Optimised compound clouds using SetData~~ Optimised compound clouds May 1, 2026

xfractalino added 3 commits May 1, 2026 17:21

$@xfractalino$

Removed now unused SIMD imports

58515a6

$@xfractalino$

Polished

ecb7018

$@xfractalino$

Implemented SIMD in the diffusion algorithm.

199621f

xfractalino added 2 commits May 2, 2026 12:06

$@xfractalino$

Added a few comments

f41f283

$@xfractalino$

Moved avx supported variable out of the loop

8464c71

hhyyrylainen reviewed May 4, 2026

View reviewed changes

Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated

hhyyrylainen reviewed May 4, 2026

View reviewed changes

Comment thread src/microbe_stage/CompoundCloudPlane.cs

hhyyrylainen reviewed May 4, 2026

View reviewed changes

Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated

hhyyrylainen reviewed May 4, 2026

View reviewed changes

Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated

hhyyrylainen reviewed May 4, 2026

View reviewed changes

Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated

$@xfractalino$

Cleanup based on review comments

3f1fe64

$@xfractalino$ xfractalino requested a review from hhyyrylainen May 4, 2026 08:41

xfractalino added 2 commits May 4, 2026 10:44

$@xfractalino$

Cast safety check

21b4c90

$@xfractalino$

Modified perfect square check and added comments

ffc5fb2

xfractalino added 4 commits May 4, 2026 11:11

$@xfractalino$

Removed AggressiveInlining hint on ProcessPixelAdvection

a491661

$@xfractalino$

Linter cleanup

665e649

$@xfractalino$

Arm support for SIMD in the advection loop and fallback

1210acb

$@xfractalino$

Linter cleanup

4fee99b

Patryk26g reviewed May 4, 2026

View reviewed changes

Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated

Patryk26g reviewed May 4, 2026

View reviewed changes

Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated

xfractalino added 7 commits May 7, 2026 10:39

$@xfractalino$

Replaced DiffuseEdges with old scalar algorithm

73d9029

$@xfractalino$

Replaced parallel DiffuseEdges to squares again

99feb01

$@xfractalino$

Using correct coordinates in PartialAdvect

bfd020a

$@xfractalino$

Renamed variables for readability

baa31d1

$@xfractalino$

Fixed x not being relative to chunkX in advection loop

45edcae

$@xfractalino$

Polished PartialDiffuseScalar

276e4c5

$@xfractalino$

Polished advection algorithm

b64ef0c

hhyyrylainen approved these changes May 8, 2026

View reviewed changes

Merge branch 'master' into compound-cloud-opt

24aa523

hhyyrylainen merged commit bc5b643 into master May 8, 2026
4 checks passed

github-project-automation Bot moved this from In progress to Done in Thrive Planning May 8, 2026

hhyyrylainen deleted the compound-cloud-opt branch May 8, 2026 13:46

Uh oh!

Conversation

xfractalino commented Apr 30, 2026 • edited by hhyyrylainen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhyyrylainen commented Apr 30, 2026

Uh oh!

Patryk26g commented Apr 30, 2026

Uh oh!

xfractalino commented Apr 30, 2026

Uh oh!

xfractalino commented May 1, 2026

Uh oh!

xfractalino commented May 1, 2026

Uh oh!

hhyyrylainen commented May 1, 2026

Uh oh!

xfractalino commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xfractalino commented May 1, 2026

Uh oh!

xfractalino commented May 1, 2026

Uh oh!

xfractalino commented May 2, 2026

Uh oh!

Uh oh!

Uh oh!

hhyyrylainen May 4, 2026

Choose a reason for hiding this comment

Uh oh!

xfractalino May 4, 2026

Choose a reason for hiding this comment

Uh oh!

hhyyrylainen May 4, 2026

Choose a reason for hiding this comment

Uh oh!

xfractalino May 4, 2026

Choose a reason for hiding this comment

Uh oh!

hhyyrylainen May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xfractalino May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhyyrylainen commented May 4, 2026

Uh oh!

xfractalino commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhyyrylainen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

$@xfractalino$ xfractalino commented Apr 30, 2026 •

edited by hhyyrylainen

Loading

xfractalino commented May 1, 2026 •

edited

Loading

$@xfractalino$ xfractalino May 4, 2026

$@xfractalino$ xfractalino May 4, 2026

hhyyrylainen May 4, 2026 •

edited

Loading

$@xfractalino$ xfractalino May 4, 2026

xfractalino commented May 4, 2026 •

edited

Loading