[ExecuTorch] Partitioner: emit target_device CompileSpec#4272
[ExecuTorch] Partitioner: emit target_device CompileSpec#4272shoumikhin wants to merge 1 commit into
Conversation
2de19fd to
ac698c2
Compare
64705aa to
f940de1
Compare
lanluo-nvidia
left a comment
There was a problem hiding this comment.
Could you please raise it on top of main, no more changes to go into 2.12, I will add a release notes, stating always default cuda:0 device
Also make sure to change py/torch_tensorrt/compile.py:1245: since currently in the public ExecuTorch save path constructs TensorRTPartitioner() without any compilespecs
Mirror executorch/backends/cuda/cuda_partitioner.py pattern in TensorRTPartitioner: append a target_device="cuda:0" CompileSpec if not already present. ExecuTorch's PropagateDevicePass (auto-runs in to_executorch()) reads the target_device CompileSpec from delegates and tags their I/O TensorSpec.device. The tag is then serialized into extra_tensor_info.device_type in the .pte. Today the tag is metadata only — ExecuTorch's default memory planner (enable_non_cpu_memory_planning=False) still allocates all tensors on CPU regardless. But once a future ExecuTorch release wires CUDA-aware memory planning + a CUDA allocator backing memory, the tag becomes load-bearing: TRT subgraph I/O will be allocated directly on CUDA, eliminating per-call host->device staging. This PR emits the metadata at .pte build time so users get the future fast path automatically, without needing a new torch-tensorrt release. Also defensive-copies the caller-supplied compile_specs list (was aliased; now copies via list()). Test plan: - Verified TRT-delegated .pte emits extra_tensor_info.device_type=1 on delegate I/O tensors (portable runtime). - Verified runtime numerical outputs unchanged (backend still stages CPU->CUDA via cudaMemcpyAsync as before). - Caller compile_specs list not mutated post-construction.
f940de1 to
5b50cbf
Compare
|
Thanks @lanluo-nvidia for the review! Rebased onto main and switched the base branch. Also added the compile.py change you asked for: Docstring on |
What
Mirror
executorch/backends/cuda/cuda_partitioner.py's pattern inTensorRTPartitioner: append atarget_device="cuda:0"CompileSpecif not already present.Why
ExecuTorch's
PropagateDevicePass(auto-runs into_executorch()) reads thetarget_deviceCompileSpec from delegates and tags their I/OTensorSpec.device. The tag is then serialized intoextra_tensor_info.device_typein the.pte.Today the tag is metadata only — ExecuTorch's default memory planner (
enable_non_cpu_memory_planning=False) still allocates all tensors on CPU regardless. But once a future ExecuTorch release wires CUDA-aware memory planning + a CUDA allocator backing memory, the tag becomes load-bearing: TRT subgraph I/O will be allocated directly on CUDA, eliminating per-call host↔device staging.This PR emits the metadata at .pte build time so users get the future fast path automatically, without needing a new torch-tensorrt release.
Pattern precedent
Direct copy of
executorch/backends/cuda/cuda_partitioner.py:42-52.Changes
py/torch_tensorrt/executorch/partitioner.py: appendCompileSpec(TARGET_DEVICE_COMPILE_SPEC_KEY, b"cuda:0")tocompile_specsinTensorRTPartitioner.__init__unless the user already provided atarget_devicespec.Also: defensive copy of caller-supplied
compile_specslist (was aliased; nowlist(compile_specs) if compile_specs else []).Notes
TensorRTBackend::execute. That was incorrect: the tag is metadata, not a guarantee of pointer provenance. ExecuTorch'scuda_backend.cpp:498-502documents this exact distinction. The backend continues to usecudaPointerGetAttributesas the source of truth..ptefiles withoutextra_tensor_infoare still parsed as CPU by the runtime; backend's existing auto-detect path handles them unchanged.Test plan
.ptethatextra_tensor_info.device_type=1is emitted on delegate I/O tensors (portable runtime).cudaMemcpyAsyncas before).compile_specslist is not mutated post-construction.