Skip to content

Optimised compound clouds#6962

Merged
hhyyrylainen merged 34 commits intomasterfrom
compound-cloud-opt
May 8, 2026
Merged

Optimised compound clouds#6962
hhyyrylainen merged 34 commits intomasterfrom
compound-cloud-opt

Conversation

@xfractalino
Copy link
Copy Markdown
Contributor

@xfractalino xfractalino commented Apr 30, 2026

This replaces the repeated SetPixel calls inside the compound clouds when copying data to the image, as it resulted in an important overhead as shown in flame graphs.
Now each cloud has a buffer, and data is wrote into this buffer before calling SetData to copy it inside the image.
It also replaces scalar logic with SIMD instructions in hotspot and removes the image copy step by using a staging buffer that's written to in the generation logic.

BEFORE:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 130
Absorber score: 129.902
Many spawners score: 124.373
Cloud sim multiplier before under 60 FPS: 3.2615
Stress test spawners: 38
Stress test average FPS: 114.786
Stress test min FPS: 36
Total test duration: 135.4s
CPU: Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz (used tasks: 4, native: 4, sim threads: True)
GPU: NVIDIA GeForce GTX 1050 Ti
OS: Windows

AFTER OPTIMISATION:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 146
Absorber score: 147.333
Many spawners score: 145
Cloud sim multiplier before under 60 FPS: 3.3154
Stress test spawners: 39
Stress test average FPS: 142.337
Stress test min FPS: 55
Total test duration: 135.3s

Update. The latest benchmark is:

AFTER OPTIMISATION:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 276
Absorber score: 274.706
Many spawners score: 258.784
Cloud sim multiplier before under 60 FPS: 6.4615
Stress test spawners: 90
Stress test average FPS: 193.82
Stress test min FPS: 45
Total test duration: 218.1s

Progress Checklist

Note: before starting this checklist the PR should be marked as non-draft.

  • PR author has checked that this PR works as intended and doesn't
    break existing features:
    https://wiki.revolutionarygamesstudio.com/wiki/Testing_Checklist
    (this is important as to not waste the time of Thrive team
    members reviewing this PR). This includes gameplay testing by the PR author.
  • Initial code review passed (this and further items should not be checked by the PR author)
  • Functionality is confirmed working by another person (see above checklist link)
  • Final code review is passed and code conforms to the
    styleguide.

Before merging all CI jobs should finish on this PR without errors, if
there are automatically detected style issues they should be fixed by
the PR author. Merging must follow our
styleguide.

@hhyyrylainen
Copy link
Copy Markdown
Member

Looks like my comments don't show on that commit's page directly, so I'll copy them here for future reference:

Pretty sure this won't be renting from the pool as there's a max size after which the allocation just goes to new immediately... (as the cloud buffer should be megabytes large).

So allocating a temp buffer and only resizing it when necessary would be the better option.

These use tasks because the update image calls with pixels were slow. So now that we have our own temporary buffer, writing directly to that by the data generation tasks would probably be a lot faster (due to data locality).

@hhyyrylainen hhyyrylainen added this to the Release 1.1.0 milestone Apr 30, 2026
@Patryk26g
Copy link
Copy Markdown
Contributor

I can confirm that in benchmarks I got a few % increase in performance

@xfractalino
Copy link
Copy Markdown
Contributor Author

Tomorrow I'll benchmark following @hhyyrylainen 's advice of writing data directly to the buffer and avoiding this copy completely.

@xfractalino
Copy link
Copy Markdown
Contributor Author

I tried to move the write calls inside the advection step as suggested, but it's actually much slower due to how it's implemented. I tried to change the advection to a semi-lagrangian (neighbours to cell instead of cell to neighbours) but it's slower because the former (current implementation) skips empty cells effectively.

So, I think that keeping the copy step is necessary as it seems to be the best approach.

@xfractalino
Copy link
Copy Markdown
Contributor Author

I ran another benchmark on the last commit and here's the results:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 157
Absorber score: 155.882
Many spawners score: 149.373
Cloud sim multiplier before under 60 FPS: 3.6692
Stress test spawners: 45
Stress test average FPS: 140.705
Stress test min FPS: 57
Total test duration: 144.6s
CPU: Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz (used tasks: 4, native: 4, sim threads: True)
GPU: NVIDIA GeForce GTX 1050 Ti
OS: Windows

@hhyyrylainen
Copy link
Copy Markdown
Member

Is it still faster to do the copy from multiple threads? As I believe it should be a straightforward copy between buffers now, so even a single core should be able to just saturate the RAM -> CPU -> RAM bandwidth due to memory prefetching. So I think it would be well worth investigating just doing the final copy with a single straightforward piece of code without tasks.

@xfractalino
Copy link
Copy Markdown
Contributor Author

xfractalino commented May 1, 2026

I tried to copy from a single thread, and it appears to be slower:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 102
Absorber score: 102.255
Many spawners score: 99.039
Cloud sim multiplier before under 60 FPS: 2.6385
Stress test spawners: 28
Stress test average FPS: 98.807
Stress test min FPS: 54
Total test duration: 118.3s

But I scheduled one task per cloud, so that the whole buffer is prefetched and all the cores are used, and it's pretty much comparable to master:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 132
Absorber score: 130.863
Many spawners score: 126.098
Cloud sim multiplier before under 60 FPS: 3.2615
Stress test spawners: 38
Stress test average FPS: 118.592
Stress test min FPS: 36
Total test duration: 134.9s

So I think there's not much benefit in keeping the copy on a single core, and it's arguably worse than the current last commit.

I think the reason is that the buffers are too big to fit in the CPU cache.

@xfractalino
Copy link
Copy Markdown
Contributor Author

I changed how the clouds are sliced (from squares to slices to ensure cache locality), flattened the density arrays from 2D and used SIMD to copy rapidly from the density arrays to the buffer.

New results:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 171
Absorber score: 168.235
Many spawners score: 164.255
Cloud sim multiplier before under 60 FPS: 3.7154
Stress test spawners: 46
Stress test average FPS: 155.092
Stress test min FPS: 50
Total test duration: 145.9s

@xfractalino
Copy link
Copy Markdown
Contributor Author

Even better results by moving copy inside advection

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 196
Absorber score: 196.882
Many spawners score: 192
Cloud sim multiplier before under 60 FPS: 4.7308
Stress test spawners: 62
Stress test average FPS: 173.024
Stress test min FPS: 50
Total test duration: 172s

@xfractalino xfractalino requested review from a team, Patryk26g and hhyyrylainen May 1, 2026 15:17
@xfractalino xfractalino changed the title Optimised compound clouds using SetData Optimised compound clouds May 1, 2026
@xfractalino
Copy link
Copy Markdown
Contributor Author

Implementing SIMD in the diffusion algorithm yields even better results:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 252
Absorber score: 253.882
Many spawners score: 242.255
Cloud sim multiplier before under 60 FPS: 6.0923
Stress test spawners: 84
Stress test average FPS: 187.601
Stress test min FPS: 59
Total test duration: 208.2s

Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated
Comment thread src/microbe_stage/CompoundCloudPlane.cs
Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated
vPixel = Vector128.Multiply(vPixel, vScale);
vPixel = Vector128.Min(Vector128.Max(vPixel, vZero), v255);
var vInt = Vector128.ConvertToInt32(vPixel);
var packed16 = Sse2.PackSignedSaturate(vInt, vInt);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully Sse2 works on Apple Silicon, if not this needs an alternative path for ARM / no SSE support.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rosetta2 (Apple Silicon) should fully support Sse2

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't use Rosetta, Thrive runs natively as ARM code on the newest Macs. So Rosetta working or not is irrelevant for our case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I was taking a look at the docs and Vector128 is supported on Arm64, so I think we should be safe on Apple Silicon. In the advection algorithm there's no issue because we already check that and fall back to the scalar algorithm.

Copy link
Copy Markdown
Member

@hhyyrylainen hhyyrylainen May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line does use Sse2 class directly, which is what I'm worried about.

Which is why I'm worried as System.Runtime.Intrinsics.X86.Sse2 is the namespace it is in so I think it might be unsupported on Arm (as ARM intrinsics are under Intrinsics.Arm namespace).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I should make a check and switch to AdvSimd.SaturatingNarrowingUnsignedLower as the docs suggest, though I want to make sure the result is the same.

Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated
Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated
Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated
@xfractalino xfractalino requested a review from hhyyrylainen May 4, 2026 08:41
@hhyyrylainen
Copy link
Copy Markdown
Member

I think there's still a few comments that are unsolved. I'll mark the ones that are solved now though.

@xfractalino
Copy link
Copy Markdown
Contributor Author

xfractalino commented May 4, 2026

The benchmarks are slightly better without forcing inlining the ProcessPixelAdvection method

Without forcing inlining:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 277
Absorber score: 281.216
Many spawners score: 261.784
Cloud sim multiplier before under 60 FPS: 6.1538
Stress test spawners: 85
Stress test average FPS: 221.733
Stress test min FPS: 52
Total test duration: 209.9s

Forcing inlining:

Benchmark results for CloudBenchmark v1
Resolution divisor: 2
Cloud spawn score: 277
Absorber score: 274.588
Many spawners score: 257.294
Cloud sim multiplier before under 60 FPS: 6.1231
Stress test spawners: 85
Stress test average FPS: 209.188
Stress test min FPS: 43
Total test duration: 209.1s

Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated
Comment thread src/microbe_stage/CompoundCloudPlane.cs Outdated
Copy link
Copy Markdown
Member

@hhyyrylainen hhyyrylainen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like according to latest testing reports this is now functioning correctly.
And the code still seems about the same as when I last reviewed. I did not re-test on Mac but hopefully nothing changed that could have impacted that.

@hhyyrylainen hhyyrylainen merged commit bc5b643 into master May 8, 2026
4 checks passed
@github-project-automation github-project-automation Bot moved this from In progress to Done in Thrive Planning May 8, 2026
@hhyyrylainen hhyyrylainen deleted the compound-cloud-opt branch May 8, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants