autocast.benchmarking.inference#
Reusable inference benchmarking utilities.
- make_synthetic_batch(example_batch, batch_size)[source]#
Create a synthetic batch with matching tensor shapes for benchmarking.
- measure_flops(model, example_batch)[source]#
Measure FLOPs for one forward pass and report GFLOPs/sample.
- benchmark_model(model, example_batch, *, n_warmup, n_benchmark, batch_size)[source]#
Benchmark model throughput and latency using synthetic batches.
- benchmark_rollout(model, example_batch, *, stride, max_rollout_steps, n_warmup, n_benchmark, batch_size, free_running_only=True)[source]#
Benchmark model rollout throughput and latency using synthetic batches.
Uses the same methodology as
benchmark_model(): warmup runs are discarded and CUDA synchronisation is applied on GPU for accurate wall-clock timings. tqdm progress output is suppressed during the run.- Returns:
throughput_samples_per_secBatch elements (samples) processed per second:
n * batch_size / t. One “sample” is one batch element completing the full rollout. Uselatency_ms_per_stepto compare per-step cost againstbenchmark_model(). latency_ms_per_batch Mean wall-clock time in ms for one full rollout call (analogous tolatency_ms_per_batchinbenchmark_model()).latency_ms_per_samplelatency_ms_per_batch / batch_size.latency_ms_per_stepMean time in ms per autoregressive step.
peak_gpu_memory_mbPeak GPU memory allocated in MB (CUDA only).
- Return type:
Dictionary with keys
- Parameters: