autocast.benchmarking.inference#

Reusable inference benchmarking utilities.

make_synthetic_batch(example_batch, batch_size)[source]#

Create a synthetic batch with matching tensor shapes for benchmarking.

Parameters:
  • example_batch (Batch)

  • batch_size (int)

Return type:

Batch

measure_flops(model, example_batch)[source]#

Measure FLOPs for one forward pass and report GFLOPs/sample.

Parameters:
Return type:

dict[str, float]

benchmark_model(model, example_batch, *, n_warmup, n_benchmark, batch_size)[source]#

Benchmark model throughput and latency using synthetic batches.

Parameters:
Return type:

dict[str, float]

benchmark_rollout(model, example_batch, *, stride, max_rollout_steps, n_warmup, n_benchmark, batch_size, free_running_only=True)[source]#

Benchmark model rollout throughput and latency using synthetic batches.

Uses the same methodology as benchmark_model(): warmup runs are discarded and CUDA synchronisation is applied on GPU for accurate wall-clock timings. tqdm progress output is suppressed during the run.

Returns:

throughput_samples_per_sec

Batch elements (samples) processed per second: n * batch_size / t. One “sample” is one batch element completing the full rollout. Use latency_ms_per_step to compare per-step cost against benchmark_model(). latency_ms_per_batch Mean wall-clock time in ms for one full rollout call (analogous to latency_ms_per_batch in benchmark_model()).

latency_ms_per_sample

latency_ms_per_batch / batch_size.

latency_ms_per_step

Mean time in ms per autoregressive step.

peak_gpu_memory_mb

Peak GPU memory allocated in MB (CUDA only).

Return type:

Dictionary with keys

Parameters:
  • model (Module)

  • example_batch (Batch)

  • stride (int)

  • max_rollout_steps (int)

  • n_warmup (int)

  • n_benchmark (int)

  • batch_size (int)

  • free_running_only (bool)