vllm bench sweep serve_sla¶
JSON CLI Arguments¶
When passing JSON CLI arguments, the following sets of arguments are equivalent:
--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'--json-arg.key1 value1 --json-arg.key2.key3 value2
Additionally, list elements can be passed individually using +:
--json-arg '{"key4": ["value3", "value4", "value5"]}'--json-arg.key4+ value3 --json-arg.key4+='value4,value5'
Options¶
--serve-cmd¶
The command used to run the server: vllm serve ...
Default: None
--bench-cmd¶
The command used to run the benchmark: vllm bench serve ...
Default: None
--after-bench-cmd¶
After a benchmark run is complete, invoke this command instead of the default ServerWrapper.clear_cache().
Default: None
--show-stdout¶
If set, logs the standard output of subcommands. Useful for debugging but can be quite spammy.
Default: False
--serve-params¶
Path to JSON file containing a list of parameter combinations for the vllm serve command. If both serve_params and bench_params are given, this script will iterate over their Cartesian product.
Default: None
--bench-params¶
Path to JSON file containing a list of parameter combinations for the vllm bench serve command. If both serve_params and bench_params are given, this script will iterate over their Cartesian product.
Default: None
-o, --output-dir¶
The directory to which results are written.
Default: results
--num-runs¶
Number of runs per parameter combination.
Default: 3
--dry-run¶
If set, prints the commands to run, then exits without executing them.
Default: False
--resume¶
Set this to the name of a directory under output_dir (which is a timestamp) to resume a previous execution of this script, i.e., only run parameter combinations for which there are still no output files.
Default: None
sla options¶
--sla-params¶
Path to JSON file containing a list of SLA constraints to satisfy. Each constraint is expressed in {"<KEY>": "<OP><VALUE>"} format, e.g.: {"p99_e2el_ms": "<=500"} means that the E2E latency should be less than 500ms 99%% of the time. Setting this option runs this script in SLA mode, which searches for the maximum sla_variable that satisfies the constraints for each combination of serve_params, bench_params, and sla_params.
Default: None
--sla-variable¶
Possible choices: request_rate, max_concurrency
Whether to tune request rate or maximum concurrency to satisfy the SLA constraints.
Default: request_rate