cputlb: set TLB_NOTDIRTY flag only for executable pages#2337
Conversation
Contrary to Qemu, Unicorn sets the TLB_NOTDIRTY flag unconditionally for all pages (since cpu_physical_memory_is_clean() always returns true). Due to this, on every write that hits a freshly filled TLB entry, Unicorn is forced to go through the slow-path (e.g., helper_be_stl_mmu() for PPC), which is much slower than the fast path, emitted as inline JITted code. Since TLB_NOTDIRTY's only purpose is to trap the first write to a page that has TBs translated from it (so notdirty_write() can invalidate those stale translations) there is no reason to set this flag for all pages, but only for executable pages. This was observed on a 512x512 int matmul running an unmodified AIX PPC32 [1] binary: 30s before this patch, ~8s after. [1]: https://github.com/Theldus/aix-user Signed-off-by: Davidson Francis <davidsondfgl@gmail.com>
|
General looks like an improvement. General it would be better to implement For the test it's OK to only test if smc is handled correct, there is not a good way to test for performance. |
Yes, having |
Hi,
While running some micro-benchmarks on my AIX user-mode emulator, I noticed a simple 'matmul' was running around ~4x slower than the same benchmark running on QEMU fullsystem/'qemu-system-ppc64'.
A perf profile pointed at the softmmu slow path:
helper_be_stl_mmu_ppc()is the softmmu slow-path store helper, and any store whose inline fast-path tag compare fails ends up here.This helper function (store_helper) eventually calls the 'notdirty_write' (conditioned to
TLB_NOTDIRTY) which then invalidates any TBs cached from that page and then clears TLB_NOTDIRTY from the entry, this is expected.However, upon further inspection, the function
tlb_set_page_with_attrs()(also in cputlb.c) ORsTLB_NOTDIRTYunconditionally for every freshly-filled TLB entry, provided the page has a PROT_WRITE permission. In Qemu, this isn't an issue since the functioncpu_physical_memory_is_clean()returns a proper value, but on Unicorn it was stubbed out to always true.Due to this, Unicorn always sets
TLB_NOTDIRTYregardless if the underlying memory is clean or not, and worse, sets this for every writable page, even if they do not contain TBs.Conditioning on
UC_PROT_EXEC, keepsTLB_NOTDIRTYonly on pages that could contain TBs. After the patch, execution drops from ~30s to ~8s on the matmul, roughly matching qemu-system-ppc64.About tests:
Since this PR is more related to a 'performance-improvement', rather than a proper bug fix, it is hard to write an automated test to verify its correctness.
The test introduced exercises the cached TBs invalidation after the TLB is already filled, which is not quite the same thing as this PR adds, since this PR avoids marking RW/RO pages as NOTDIRTY at the first TLB fill, not after. However, it ensures this patch does not inadvertently break SMC on RWX pages.