Optimised compound clouds#6962
Conversation
|
Looks like my comments don't show on that commit's page directly, so I'll copy them here for future reference:
|
|
I can confirm that in benchmarks I got a few % increase in performance |
|
Tomorrow I'll benchmark following @hhyyrylainen 's advice of writing data directly to the buffer and avoiding this copy completely. |
|
I tried to move the write calls inside the advection step as suggested, but it's actually much slower due to how it's implemented. I tried to change the advection to a semi-lagrangian (neighbours to cell instead of cell to neighbours) but it's slower because the former (current implementation) skips empty cells effectively. So, I think that keeping the copy step is necessary as it seems to be the best approach. |
|
I ran another benchmark on the last commit and here's the results: |
|
Is it still faster to do the copy from multiple threads? As I believe it should be a straightforward copy between buffers now, so even a single core should be able to just saturate the RAM -> CPU -> RAM bandwidth due to memory prefetching. So I think it would be well worth investigating just doing the final copy with a single straightforward piece of code without tasks. |
|
I tried to copy from a single thread, and it appears to be slower: But I scheduled one task per cloud, so that the whole buffer is prefetched and all the cores are used, and it's pretty much comparable to master: So I think there's not much benefit in keeping the copy on a single core, and it's arguably worse than the current last commit. I think the reason is that the buffers are too big to fit in the CPU cache. |
|
I changed how the clouds are sliced (from squares to slices to ensure cache locality), flattened the density arrays from 2D and used SIMD to copy rapidly from the density arrays to the buffer. New results: |
|
Even better results by moving copy inside advection |
|
Implementing SIMD in the diffusion algorithm yields even better results: |
| vPixel = Vector128.Multiply(vPixel, vScale); | ||
| vPixel = Vector128.Min(Vector128.Max(vPixel, vZero), v255); | ||
| var vInt = Vector128.ConvertToInt32(vPixel); | ||
| var packed16 = Sse2.PackSignedSaturate(vInt, vInt); |
There was a problem hiding this comment.
Hopefully Sse2 works on Apple Silicon, if not this needs an alternative path for ARM / no SSE support.
There was a problem hiding this comment.
Rosetta2 (Apple Silicon) should fully support Sse2
There was a problem hiding this comment.
We actually don't use Rosetta, Thrive runs natively as ARM code on the newest Macs. So Rosetta working or not is irrelevant for our case.
There was a problem hiding this comment.
Okay I was taking a look at the docs and Vector128 is supported on Arm64, so I think we should be safe on Apple Silicon. In the advection algorithm there's no issue because we already check that and fall back to the scalar algorithm.
There was a problem hiding this comment.
This line does use Sse2 class directly, which is what I'm worried about.
Which is why I'm worried as System.Runtime.Intrinsics.X86.Sse2 is the namespace it is in so I think it might be unsupported on Arm (as ARM intrinsics are under Intrinsics.Arm namespace).
There was a problem hiding this comment.
This makes sense. I should make a check and switch to AdvSimd.SaturatingNarrowingUnsignedLower as the docs suggest, though I want to make sure the result is the same.
|
I think there's still a few comments that are unsolved. I'll mark the ones that are solved now though. |
|
The benchmarks are slightly better without forcing inlining the |
hhyyrylainen
left a comment
There was a problem hiding this comment.
Seems like according to latest testing reports this is now functioning correctly.
And the code still seems about the same as when I last reviewed. I did not re-test on Mac but hopefully nothing changed that could have impacted that.
This replaces the repeated
SetPixelcalls inside the compound clouds when copying data to the image, as it resulted in an important overhead as shown in flame graphs.Now each cloud has a buffer, and data is wrote into this buffer before calling
SetDatato copy it inside the image.It also replaces scalar logic with SIMD instructions in hotspot and removes the image copy step by using a staging buffer that's written to in the generation logic.
AFTER OPTIMISATION:Benchmark results for CloudBenchmark v1Resolution divisor: 2Cloud spawn score: 146Absorber score: 147.333Many spawners score: 145Cloud sim multiplier before under 60 FPS: 3.3154Stress test spawners: 39Stress test average FPS: 142.337Stress test min FPS: 55Total test duration: 135.3sUpdate. The latest benchmark is:
Progress Checklist
Note: before starting this checklist the PR should be marked as non-draft.
break existing features:
https://wiki.revolutionarygamesstudio.com/wiki/Testing_Checklist
(this is important as to not waste the time of Thrive team
members reviewing this PR). This includes gameplay testing by the PR author.
styleguide.
Before merging all CI jobs should finish on this PR without errors, if
there are automatically detected style issues they should be fixed by
the PR author. Merging must follow our
styleguide.