Additional Configuration#
Additional configuration is a mechanism provided by vLLM to allow plugins to control internal behavior by themselves. VLLM Ascend uses this mechanism to make the project more flexible.
How to use#
With either online mode or offline mode, users can use additional configuration. Take Qwen3 as an example:
Online mode:
vllm serve Qwen/Qwen3-8B --additional-config='{"config_key":"config_value"}'
Offline mode:
from vllm import LLM
LLM(model="Qwen/Qwen3-8B", additional_config={"config_key":"config_value"})
Configuration options#
The following table lists additional configuration options available in vLLM Ascend:
Name |
Type |
Default |
Description |
|---|---|---|---|
|
dict |
|
Configuration options for Xlite graph mode |
|
dict |
|
Configuration options for weight prefetch |
|
dict |
|
Configuration options for module tensor parallelism |
|
dict |
|
Configuration options for ascend compilation |
|
dict |
|
Configuration options for ascend compilation |
|
dict |
|
Configuration options for Npugraph_ex backend |
|
bool |
|
Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
|
str |
|
Configuration file path for msprobe dump(eager mode). |
|
bool |
|
Whether to enable asynchronous exponential overlap. To enable asynchronous exponential, set this config to True. |
|
bool |
|
When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
|
bool |
|
Whether to enable multi-stream shared expert. This option only takes effect on MoE models with shared experts. |
|
bool |
|
Whether to enable multi-stream overlap gate. This option only takes effect on MoE models with shared experts. |
|
bool |
|
Whether to enable recompute scheduler. |
|
bool |
|
Whether to enable CPU binding. Only takes effect on ARM CPUs; when enabled, A3 uses NUMA-balanced binding strategy and other device types use NUMA-affinity’s. |
|
int |
|
SLO limits for dynamic batch. This is new scheduler to support dynamic batch feature |
|
bool |
|
Whether to enable npugraph_ex graph mode. |
|
list |
|
The custom shape list of page attention ops. |
|
bool |
|
Whether to enable KV cache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |
|
dict |
|
Configuration options for Layer Sharding Linear |
|
int |
|
For dense models, only num_tokens > threshold will enable sequence parallelism. |
The details of each configuration option are as follows:
xlite_graph_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable Xlite graph mode. Currently only Llama, Qwen dense series models, and Qwen3-VL are supported. |
|
bool |
|
Whether to enable Xlite for both the prefill and decode stages. By default, Xlite is only enabled for the decode stage. |
weight_prefetch_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable weight prefetch. |
|
dict |
|
Prefetch ratio of each weight. |
finegrained_tp_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
The custom tensor parallel size of lm_head. |
|
int |
|
The custom tensor parallel size of o_proj. |
|
int |
|
The custom tensor parallel size of embedding. |
|
int |
|
The custom tensor parallel size of mlp. |
ascend_compilation_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable fuse_norm_quant pass. |
|
bool |
|
Whether to enable fuse_qknorm_rope pass. If Triton is not in the environment, set it to False. |
|
bool |
|
Whether to enable fuse_allreduce_rms pass. It’s set to False because of conflict with SP. |
eplb_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable dynamic EPLB. |
|
str |
|
When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
|
int |
|
Forward iterations when EPLB begins. |
|
int |
|
The forward iterations when the EPLB worker will finish CPU tasks. |
|
str |
|
Save the expert load calculation results to a new expert table in the specified directory. |
|
int |
|
Specify redundant experts during initialization. |
npugraph_ex_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable npugraph_ex backend. |
|
bool |
|
Whether to enable static kernel. Suitable for scenarios where shape changes are minimal and some time is available for static kernel compilation. |
|
bool |
|
Whether to enable fuse_norm_quant pass. |
|
bool |
|
Whether to enable fuse_qknorm_rope pass. If Triton is not in the environment, set it to False. |
|
bool |
|
Whether to enable fuse_allreduce_rms pass. It’s set to False because of conflict with SP. |
Example#
An example of additional configuration is as follows:
{
"weight_prefetch_config": {
"enabled": True,
"prefetch_ratio": {
"attn": {
"qkv": 1.0,
"o": 1.0,
},
"moe": {
"gate_up": 0.8
},
"mlp": {
"gate_up": 1.0,
"down": 1.0
}
},
},
"finegrained_tp_config": {
"lmhead_tensor_parallel_size": 8,
"oproj_tensor_parallel_size": 8,
"embedding_tensor_parallel_size": 8,
"mlp_tensor_parallel_size": 8,
},
"enable_kv_nz": False,
"multistream_overlap_shared_expert": True,
"refresh": False
}