autocast.benchmarking.inference

autocast.benchmarking.inference#

Reusable inference benchmarking utilities.

make_synthetic_batch(example_batch, batch_size)[source]#

Create a synthetic batch with matching tensor shapes for benchmarking.

Parameters:

example_batch (Batch)
batch_size (int)

Return type:

Batch

measure_flops(model, example_batch)[source]#

Measure FLOPs for one forward pass and report GFLOPs/sample.

Parameters:

model (Module)
example_batch (Batch)

Return type:

dict[str, float]

benchmark_model(model, example_batch, *, n_warmup, n_benchmark, batch_size)[source]#

Benchmark model throughput and latency using synthetic batches.

Parameters:

model (Module)
example_batch (Batch)
n_warmup (int)
n_benchmark (int)
batch_size (int)

Return type:

dict[str, float]

benchmark_rollout(model, example_batch, *, stride, max_rollout_steps, n_warmup, n_benchmark, batch_size, free_running_only=True)[source]#

Benchmark model rollout throughput and latency using synthetic batches.

Uses the same methodology as benchmark_model(): warmup runs are discarded and CUDA synchronisation is applied on GPU for accurate wall-clock timings. tqdm progress output is suppressed during the run.

Returns:

throughput_samples_per_sec: Batch elements (samples) processed per second: n * batch_size / t. One “sample” is one batch element completing the full rollout. Use latency_ms_per_step to compare per-step cost against benchmark_model(). latency_ms_per_batch Mean wall-clock time in ms for one full rollout call (analogous to latency_ms_per_batch in benchmark_model()).
latency_ms_per_sample: latency_ms_per_batch / batch_size.
latency_ms_per_step: Mean time in ms per autoregressive step.
peak_gpu_memory_mb: Peak GPU memory allocated in MB (CUDA only).

Return type:

Dictionary with keys

Parameters:

model (Module)
example_batch (Batch)
stride (int)
max_rollout_steps (int)
n_warmup (int)
n_benchmark (int)
batch_size (int)
free_running_only (bool)

autocast.benchmarking.inference

Contents

autocast.benchmarking.inference#