volresample – 3D volume resampling up to 11× faster than PyTorch on CPU

Update (2026-04-05): This post was updated after volresample gained mode='cubic' and align_corners=True support for linear and cubic resampling. The benchmark tables below were rerun against the current code (v0.2.0).

TL;DR: I wrote a small Cython + OpenMP library that resamples 3D volumes (e.g. medical images). It’s an (almost) drop-in replacement for torch.nn.functional.interpolate and torch.nn.functional.grid_sample, now supports nearest, linear, area, and cubic interpolation, and runs up to 11× faster than PyTorch on CPU in the current benchmark suite. For cubic resampling, it is also substantially faster than SciPy on the same machine, reaching 19× in the 4-thread benchmark.

GitHub: JoHof/volresample
pip install volresample

The Problem

If you’ve ever built a medical imaging pipeline — segmentation, registration, anything involving CT or MRI — you spend a surprising fraction of your runtime just moving voxels around. Resampling is everywhere:

Normalizing spacing before feeding a scan into a network
Augmenting training data with resizing or deformations
Running inference in a sliding-window fashion, then stitching patches back together at a different resolution
Post-processing: upsampling predicted segmentation masks back to the original patient space

For all of these, PyTorch’s F.interpolate and F.grid_sample are the go-to tools. They’re well-tested, GPU-accelerated, and have a clean API. But then comes the awkward reality in most production and research pipelines:

You’re working on CPU anyway. Most preprocessing and postprocessing happens outside the GPU hot path, in DataLoader workers, in MONAI transforms, in post-hoc analysis scripts. You don’t have a CUDA context, and you don’t want one.
PyTorch is at least a 2 GB install. If you’re shipping a lightweight container, a CLI tool, or just a utility script, pulling in all of PyTorch for F.interpolate is embarrassing.
PyTorch CPU performance has some surprising gaps. Area-mode interpolation doesn’t parallelize well for single-image workloads on PyTorch CPU. int16 isn’t natively supported and needs round-trips through float32. These aren’t bugs — they’re just not the priority for a GPU-first library.

What volresample does

It exposes two functions, intentionally (almost) mirroring PyTorch’s API:

import volresample

# Like F.interpolate(..., mode='trilinear', align_corners=False)
resampled = volresample.resample(volume, (128, 128, 128), mode='linear')

# Like F.interpolate(..., mode='trilinear', align_corners=True)
aligned = volresample.resample(volume, (192, 192, 192), mode='linear', align_corners=True)

# Cubic resampling, matching scipy.ndimage.zoom(order=3, mode='reflect')
cubic = volresample.resample(volume, (96, 96, 96), mode='cubic')

# Like F.grid_sample(..., align_corners=False)
sampled = volresample.grid_sample(input, grid, mode='linear', padding_mode='zeros')

Both accept NumPy arrays. The resample() function handles 3D (D, H, W), 4D (C, D, H, W) and 5D (N, C, D, H, W) inputs. Thread count is configurable:

volresample.set_num_threads(8)

That’s essentially the whole API.

Use Cases

Preprocessing Pipelines

In medical imaging, raw scans come in all kinds of resolutions — a CT scan might be 512×512×350 with non-isotropic resolution, while your network expects 128×128×128 or an isotropic resolution of 1.5 mm. You need to resample every single scan before training, and potentially again for each augmented copy. For evaluation you might have to resample back to the original spacing.

Data Augmentation

Random elastic and affine augmentations are typically implemented as a grid_sample call: compute a displacement field, then sample the volume at displaced coordinates. This is the hot path in augmentation-heavy training.

In the current benchmark case (1x2x96x96x96 -> 80x80x80), grid_sample(..., mode='linear', padding_mode='zeros') drops from 27.2 ms at 1 thread to 7.0 ms at 4 threads in volresample, while PyTorch goes from 35.6 ms to 30.7 ms. That turns a modest 1.3× lead into a 4.4× lead once the CPU is actually used in parallel.

For the person whose augmentation pipeline is the DataLoader bottleneck (it’s more people than you’d think), this kind of throughput difference directly translates into faster experiments.

Inference Without a GPU

Lots of clinical deployment happens on CPU-only hardware. Maybe it’s a hospital workstation with no GPU, a laptop, an edge device, or a cloud VM that doesn’t have GPU quota. Your model runs on PyTorch, but most of the work around the model — resampling the input to the right spacing, resampling the output back — can be done faster without PyTorch at all.

The Benchmarks

These numbers are from two fresh runs on my Intel i7-8565U, executed exactly as:

cd /home/jo/projects/volresample
nix develop --command python -u tests/benchmark.py --threads 1
nix develop --command python -u tests/benchmark.py --threads 4

Those runs took about 33.6 seconds at 1 thread and 24.4 seconds at 4 threads. They compare volresample against PyTorch for nearest/linear/area and grid_sample, and against SciPy for cubic mode.

Here is the full benchmark suite in one table:

Case	Ref	Shape	Ref 1T	volresample 1T	Speedup 1T	Ref 4T	volresample 4T	Speedup 4T
nearest	PyTorch	`128x128x128 -> 64x64x64`	0.77 ms	0.33 ms	2.33×	0.34 ms	0.09 ms	3.73×
nearest (`uint8`)	PyTorch	`128x128x128 -> 64x64x64`	0.61 ms	0.25 ms	2.49×	0.22 ms	0.07 ms	2.99×
nearest (`int16`)	PyTorch	`128x128x128 -> 64x64x64`	1.45 ms	0.33 ms	4.37×	0.97 ms	0.09 ms	11.38×
linear	PyTorch	`128x128x128 -> 64x64x64`	2.28 ms	1.13 ms	2.01×	0.78 ms	0.42 ms	1.84×
linear, `align_corners=True`	PyTorch	`96x96x96 -> 144x144x144`	25.56 ms	11.26 ms	2.27×	7.41 ms	3.53 ms	2.10×
area	PyTorch	`160x160x160 -> 80x80x80`	20.96 ms	6.37 ms	3.29×	22.57 ms	2.33 ms	9.69×
4D linear	PyTorch	`4x96x96x96 -> 64x64x64`	8.98 ms	4.84 ms	1.86×	4.27 ms	1.77 ms	2.41×
5D linear	PyTorch	`2x4x80x80x80 -> 48x48x48`	7.84 ms	4.45 ms	1.76×	2.46 ms	1.92 ms	1.28×
cubic, `align_corners=False`	SciPy	`128x128x128 -> 64x64x64`	168.92 ms	58.35 ms	2.90×	180.29 ms	22.16 ms	8.14×
cubic, `align_corners=True`	SciPy	`96x128x80 -> 64x160x48`	229.36 ms	36.49 ms	6.29×	234.54 ms	12.27 ms	19.12×
grid linear, zeros	PyTorch	`1x2x96x96x96 -> 80x80x80`	35.61 ms	27.23 ms	1.31×	30.74 ms	7.02 ms	4.38×
grid nearest, zeros	PyTorch	`1x2x96x96x96 -> 80x80x80`	9.26 ms	4.64 ms	2.00×	9.21 ms	2.80 ms	3.29×
grid linear, reflection	PyTorch	`1x2x80x96x64 -> 72x88x56`	38.88 ms	17.60 ms	2.21×	44.33 ms	7.13 ms	6.21×

The high-level picture is more interesting when you compare 1 thread and 4 threads side by side:

Across the full 13-case suite, the average speedup grows from 2.70× at 1 thread to 5.89× at 4 threads.
The biggest scaling wins are the workloads where the reference implementation barely benefits from extra threads. Area mode goes from 3.29× to 9.69×, cubic with align_corners=True goes from 6.29× to 19.12×, and grid_sample with reflection padding goes from 2.21× to 6.21×.
grid_sample is a good example of why the 1-thread vs 4-thread comparison matters. At 1 thread, linear/zeros is only 1.31× faster. At 4 threads, the exact same case is 4.38× faster because volresample scales while PyTorch moves only a little.
Not everything widens at 4 threads. The 5D batched linear case drops from 1.76× to 1.28× because PyTorch scales relatively well there too. That is a useful reminder that this is not a universal “always 10× faster” story.
The single largest PyTorch win is still int16 nearest-neighbor. That path jumps from 4.37× at 1 thread to 11.38× at 4 threads because volresample keeps the direct int16 implementation while PyTorch still pays the dtype-conversion tax.

Two results deserve more explanation because they’re not just “Cython is fast”:

The int16 story (11.4× vs PyTorch at 4T): PyTorch has no native int16 interpolation path. When you want to call F.interpolate on an int16 tensor, you have to cast to float32 and eventually back. volresample compiles a specialized nearest-neighbor path for int16 directly using Cython’s fused types — no conversion, no extra allocation. The largest PyTorch speedup in the current suite still comes almost entirely from avoiding that round-trip.

The cubic story (up to 19.1× vs SciPy at 4T): cubic mode is implemented as tricubic B-spline interpolation with an IIR prefilter and validated against scipy.ndimage.zoom(order=3, mode='reflect'). The align_corners flag switches between SciPy’s grid_mode=True and grid_mode=False semantics. This makes cubic useful not just as a new checkbox feature, but as a practical high-quality CPU resampler with a clear reference implementation.

The area mode story (3.3× at 1T, 9.7× at 4T): Area interpolation remains one of the strongest CPU wins. For straightforward 3D downsampling, volresample parallelizes efficiently over spatial work while PyTorch barely changes between the 1-thread and 4-thread runs on this machine.

Architecture Choices

Some choices I made which I think helped with speed.

Fused types for multi-dtype without code duplication

Nearest-neighbor interpolation works on any integer or float type; you’re just reading a value and copying it. Rather than writing three separate implementations, Cython’s fused types let you write one:

ctypedef fused numeric_type:
    uint8_t
    int16_t
    float

cdef void _resample_nearest(numeric_type* data_ptr, numeric_type* output_ptr, ...) noexcept nogil:
    # one implementation, compiled three times at specialization

At compile time, Cython generates separate specializations for each type. At runtime, it dispatches based on the dtype of the numpy array. The dispatch overhead is a single Python-level type check before the C call; the inner loop runs with no branches.

Linear and area modes only support float32 since interpolation requires fractional weights. This is the correct trade-off — operating in lower precision during interpolation would compromise accuracy.

Pre-computed index tables

The most obvious optimization that many naive implementations miss (also the helpful LLM I used during development by the way): source coordinates depend only on the output index and the scale ratio. Computing them inside the inner loop is wasteful. Pre-computing them into a small table eliminates redundant floating-point arithmetic for every voxel:

# Computed once, O(D + H + W) work
for od in range(out_d): d_indices[od] = compute_index(od, scale_d)
for oh in range(out_h): h_indices[oh] = compute_index(oh, scale_h)
# ...

# Main loop: pure table lookups
for od in prange(out_d, nogil=True):
    for oh in range(out_h):
        for ow in range(out_w):
            value = data_ptr[d_indices[od] * H * W + h_indices[oh] * W + w_indices[ow]]

For trilinear interpolation the pre-computed tables also hold the fractional weights, which are reused across the 8 neighbors without recomputation.

No branching in inner loops

For grid_sample, different padding modes (zeros, border, reflection) require fundamentally different out-of-bounds handling. Checking a flag inside the innermost loop is cheap in absolute terms, but it may break the compiler’s ability to reason about the code and prevent certain vectorization patterns.

The solution is to specialize: there are 6 separate functions, one for each (mode, padding_mode) combination. The Python entry point dispatches to the right one based on the arguments. The inner loop is clean, predictable code without branching that the compiler can actually optimize. I’d not code like that without LLM support though — maintaining so much duplicated code would be a nightmare. Strange new world.

_grid_sample_bilinear_zeros()
_grid_sample_bilinear_border()
_grid_sample_bilinear_reflection()
_grid_sample_nearest_zeros()
...

OpenMP on the outermost spatial dimension

Parallelization happens on the depth dimension (prange over out_d). This is the standard choice because:

It maximizes the chunk size per thread, minimizing synchronization overhead
Each thread gets a contiguous slab of memory to write, improving cache efficiency
There are no data dependencies between output voxels

Thread count defaults to min(cpu_count, 4) on the theory that most machines top out their beneficial scaling around 4 threads for memory-bound workloads, and you don’t want a library quietly saturating all cores. It’s fully configurable though.

Build system: architecture-aware compiler flags

# setup.py
if platform.machine() in ('x86_64', 'AMD64', 'i686', 'i386'):
    extra_compile_args = ['-O3', '-mavx2', '-mfma', '-fopenmp']
else:
    extra_compile_args = ['-O3', '-fopenmp']  # ARM, etc.

AVX2 + FMA flags are applied on x86 to let the compiler auto-vectorize. The combined Cython directives (boundscheck=False, wraparound=False, cdivision=True, initializedcheck=False) eliminate nearly all Python overhead from the inner loop.

Testing against PyTorch and SciPy

Every supported mode is tested against the reference implementation it is supposed to match. For nearest, linear, area, and grid_sample, that’s PyTorch. For cubic, it’s SciPy. This isn’t just a correctness check — it’s also documentation. The test suite is the canonical statement of behavioral compatibility:

torch_output = F.interpolate(input, size, mode='trilinear', align_corners=False)
cython_output = volresample.resample(input, size, mode='linear')
assert np.allclose(torch_output.numpy(), cython_output, atol=1e-5)

scipy_output = zoom(input, zoom=factors, order=3, mode='reflect', grid_mode=True)
cython_output = volresample.resample(input, size, mode='cubic', align_corners=False)
assert np.allclose(scipy_output, cython_output, atol=1e-5)

What it doesn’t do

To be honest about scope:

No GPU. If your data is already on a CUDA device, use PyTorch. The library is for CPU workloads.
No 2D images. Not directly exposed in the API, but achievable via volumes with a singleton dimension (1, H, W). No multi-threading benefit in that case, though.
No autograd. It’s a NumPy library. There’s no gradient computation.
float32 only for trilinear/area. By design — interpolation in lower precision gives questionable results, especially for medical images where intensity values carry clinical meaning.
No complete duplication of the PyTorch or SciPy APIs. I still don’t want to carry over every historical quirk. linear is linear regardless of dimensionality, cubic is exposed as a single mode instead of mirroring SciPy’s broader spline API.

Thus:

PyTorch correspondence:

volresample	PyTorch `F.interpolate`
`mode='nearest'`	`mode='nearest-exact'`
`mode='linear', align_corners=False`	`mode='trilinear', align_corners=False`
`mode='linear', align_corners=True`	`mode='trilinear', align_corners=True`
`mode='area'`	`mode='area'`

volresample	PyTorch `F.grid_sample`
`mode='nearest'`	`mode='nearest'`
`mode='linear'`	`mode='bilinear'`

SciPy correspondence:

volresample	SciPy
`mode='cubic', align_corners=False`	`scipy.ndimage.zoom(order=3, mode='reflect', grid_mode=True)`
`mode='cubic', align_corners=True`	`scipy.ndimage.zoom(order=3, mode='reflect', grid_mode=False)`

Prior art

SciPy’s ndimage.zoom also does this and has been doing it for decades. The main differences: volresample is faster for the modes it supports, has explicit batched multi-channel support, matches PyTorch’s coordinate conventions for linear/grid sampling, and now has a cubic mode with an explicit SciPy correspondence instead of leaving users to guess how the coordinates line up.

SimpleITK’s ResampleImageFilter is the standard for actual clinical resampling and handles all the metadata correctly (spacing, direction cosines, etc.). volresample makes no attempt to handle image metadata — it’s a tensor operation, not a specific medical image processing library.

Closing

I’m not going to pretend this is a revolutionary piece of work. It’s a fairly small Cython library that does one thing well. But “one thing well” is exactly what some people might need.

If you’re building a medical imaging system in Python and your profiler keeps pointing at resampling, give it a try. If you find something wrong with the numerical results or the performance claims, I want to know.