Skip to content

cputlb: set TLB_NOTDIRTY flag only for executable pages#2337

Open
Theldus wants to merge 1 commit into
unicorn-engine:devfrom
Theldus:tlb_notdirty-fix
Open

cputlb: set TLB_NOTDIRTY flag only for executable pages#2337
Theldus wants to merge 1 commit into
unicorn-engine:devfrom
Theldus:tlb_notdirty-fix

Conversation

@Theldus
Copy link
Copy Markdown

@Theldus Theldus commented May 20, 2026

Hi,
While running some micro-benchmarks on my AIX user-mode emulator, I noticed a simple 'matmul' was running around ~4x slower than the same benchmark running on QEMU fullsystem/'qemu-system-ppc64'.

A perf profile pointed at the softmmu slow path:

Samples: 133K of event 'cycles:Pu', Event count (approx.): 118136951722
  Children      Self  Command   Shared Object      Symbol
+   51.85%    19.92%  aix-user  libunicorn.so.2    [.] helper_be_stl_mmu_ppc
+   26.16%     7.05%  aix-user  libunicorn.so.2    [.] find_memory_mapping_ppc
+   19.56%     5.60%  aix-user  libunicorn.so.2    [.] notdirty_write.isra.0
+   18.45%    18.28%  aix-user  libunicorn.so.2    [.] flatview_translate_ppc
+   13.68%     0.00%  aix-user  [JIT] tid 2807     [.] 0x00007f8afda25dba
+   13.62%     0.00%  aix-user  [JIT] tid 2807     [.] 0x00007f8afda25d34
+   13.05%     0.00%  aix-user  [JIT] tid 2807     [.] 0x00007f8afda25a91
+   11.18%     0.00%  aix-user  [JIT] tid 2807     [.] 0x00007f8afda25c23
[snip]

helper_be_stl_mmu_ppc() is the softmmu slow-path store helper, and any store whose inline fast-path tag compare fails ends up here.

This helper function (store_helper) eventually calls the 'notdirty_write' (conditioned to TLB_NOTDIRTY) which then invalidates any TBs cached from that page and then clears TLB_NOTDIRTY from the entry, this is expected.

However, upon further inspection, the function tlb_set_page_with_attrs() (also in cputlb.c) ORs TLB_NOTDIRTY unconditionally for every freshly-filled TLB entry, provided the page has a PROT_WRITE permission. In Qemu, this isn't an issue since the function cpu_physical_memory_is_clean() returns a proper value, but on Unicorn it was stubbed out to always true.

Due to this, Unicorn always sets TLB_NOTDIRTY regardless if the underlying memory is clean or not, and worse, sets this for every writable page, even if they do not contain TBs.

Conditioning on UC_PROT_EXEC, keeps TLB_NOTDIRTY only on pages that could contain TBs. After the patch, execution drops from ~30s to ~8s on the matmul, roughly matching qemu-system-ppc64.

About tests:
Since this PR is more related to a 'performance-improvement', rather than a proper bug fix, it is hard to write an automated test to verify its correctness.

The test introduced exercises the cached TBs invalidation after the TLB is already filled, which is not quite the same thing as this PR adds, since this PR avoids marking RW/RO pages as NOTDIRTY at the first TLB fill, not after. However, it ensures this patch does not inadvertently break SMC on RWX pages.

Contrary to Qemu, Unicorn sets the TLB_NOTDIRTY flag unconditionally for
all pages (since cpu_physical_memory_is_clean() always returns true).

Due to this, on every write that hits a freshly filled TLB entry, Unicorn
is forced to go through the slow-path (e.g., helper_be_stl_mmu() for
PPC), which is much slower than the fast path, emitted as inline JITted
code.

Since TLB_NOTDIRTY's only purpose is to trap the first write to a page
that has TBs translated from it (so notdirty_write() can invalidate those
stale translations) there is no reason to set this flag for all pages,
but only for executable pages.

This was observed on a 512x512 int matmul running an unmodified AIX
PPC32 [1] binary: 30s before this patch, ~8s after.

[1]: https://github.com/Theldus/aix-user

Signed-off-by: Davidson Francis <davidsondfgl@gmail.com>
@PhilippTakacs
Copy link
Copy Markdown
Contributor

General looks like an improvement.

General it would be better to implement cpu_physical_memory_is_clean() again. This would also help for situations where a big chunk of memory is mapped rwx and the mmu is used to set permissions. I see this is too much to ask for in this PR.

For the test it's OK to only test if smc is handled correct, there is not a good way to test for performance.

@Theldus
Copy link
Copy Markdown
Author

Theldus commented May 20, 2026

General it would be better to implement cpu_physical_memory_is_clean() again. This would also help for situations where a big chunk of memory is mapped rwx and the mmu is used to set permissions. I see this is too much to ask for in this PR.

Yes, having cpu_physical_memory_is_clean() properly implemented would be the ideal, but it seems to require a major rework in order to support that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants