Suffix Speculative Decoding

Suffix Speculative Decoding#

Introduction#

Suffix Decoding is an optimization technique for speculative decoding based on pattern matching. It simultaneously retrieves repetitive sequences from both the prompt and the generated content, using frequency statistics to predict the most likely token continuations. Unlike traditional speculative decoding methods, Suffix Decoding runs entirely on the CPU, eliminating the need for additional GPU resources or draft models, which results in superior acceleration for repetitive tasks such as AI agents and code generation.

This document provides step-by-step guidance on how to deploy and benchmark the Suffix Decoding speculative inference technology supported by vllm-ascend on Atlas A2 hardware. The setup utilizes a single Atlas 800T A2 node with a 4-card deployment of the Qwen3-32B model instance. Benchmarking is conducted using authentic open-source datasets covering the following categories:

Dataset Category	Dataset Name
Code Generation	HumanEval
Common Sense Reasoning	ARC
Mathematical Reasoning	gsm8k
Natural Language Understanding	SuperGLUE_BoolQ
Comprehensive Examination	agieval
Multi-turn Dialogue	sharegpt

The benchmarking tool used in this tutorial is AISBench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.

Download vllm-ascend Image#

This tutorial uses the official image, version v0.13.0rc1. Use the following command to download:

docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1

Run with Docker#

Container startup command:

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc1
export NAME=vllm-ascend

# Run the container using the defined variables
# This test uses four Atlas A2 NPU cards to create the container.
# Mount the hccn.conf file from the host node into the container.

docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:\
/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

Install arctic-inference#

Before enabling Suffix Decoding speculative inference on Ascend, the Arctic Inference plugin must be installed. Arctic Inference is an open-source plugin launched by Snowflake specifically to optimize LLM inference speed. For detailed technical principles, please refer to the following article: Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training. Install it within the container using the following command:

pip install arctic-inference

vLLM Instance Deployment#

Use the following command to start the container service instance. Speculative inference is enabled via the --speculative-config parameter, where method is set to suffix. For this test, num_speculative_tokens is uniformly set to 3.

# set the NPU device number:
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
export TASK_QUEUE_ENABLE=1
# Enable the AIVector core to directly schedule ROCE communication.
export HCCL_OP_EXPANSION_MODE="AIV"
# Enable MLP prefetch for better performance.
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /data/Qwen3-32B \
  --served-model-name qwen3 \
  --trust-remote-code \
  --distributed-executor-backend mp \
  --tensor-parallel-size 4 \
  --max-model-len 5500 \
  --max-num-batched-tokens 40960 \
  --speculative-config '{"method": "suffix", "num_speculative_tokens": 3}' \
  --gpu-memory-utilization 0.9 \
  --additional-config '{"pa_shape_list":[48,64,72,80]}' \
  --port 8011

AISbench Benchmark Testing#

Performance for all open-source datasets is tested using AISbench. For specific instructions, refer to Using AISBench for performance evaluation.

Model Configuration:

# "ignore_eos" must be set to "False", and "max_out_len" should be set to a large value to allow the model to output completely and naturally.

from ais_bench.benchmark.models import VLLMCustomAPIChatStream

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="<path_to_your_model>/Qwen3-32B",
        model="qwen3",
        request_rate = 0,
        retry = 2,
        host_ip = "<your_server_ip>",
        host_port = 8011,
        max_out_len = 4000,
        batch_size= 16,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0,
            ignore_eos = False
        )
    )
]

Performance Benchmarking Commands:

# Example command to test gsm8k dataset performance using the first 100 prompts. Commands for other datasets are similar.
ais_bench --models vllm_api_stream_chat \
  --datasets gsm8k_gen_0_shot_cot_str_perf \
  --debug --summarizer default_perf --mode perf --num-prompts 100

Test Results#

Below are the detailed test results of the six open-source datasets in this evaluation. Compared to the baseline performance, the improvement in TPOT and throughput performance at different concurrency levels after enabling Suffix Decoding varies across datasets. The extent of improvement after enabling Suffix Decoding differs among the datasets. Below is a summary of the results:

Dataset Category	Typical Representative	Throughput Improvement (BS=1-10)	SLO TPOT
High Gain	AGIEval, GSM8K	> 50%	< 50ms
Medium-Low Gain	ARC, ShareGPT	20% ~ 30%	< 50ms

Below is the raw detailed test results:

Concurrency	Avg Input	Avg Output	Requests	Base TPOT(ms)	Base Throughput(TPS)	Suffix TPOT(ms)	Suffix Throughput(TPS)	Accept Rate	TPOT Gain	TPS Gain
Humaneval
1	150	2700	100	55.1	18.1	37.9	26.3	27.0%	45.2%	45.1%
15	150	2700	100	61.6	233.8	45.8	318.2	27.0%	34.6%	36.1%
26	150	2700	100	64.7	403.8	50.9	519.2	27.0%	27.2%	28.6%
ARC
1	76	960	100	52.8	18.9	39.5	25.4	23.9%	33.7%	34.6%
8	76	960	100	59.1	125.4	47.0	163.1	23.9%	25.7%	30.0%
15	76	960	100	59.8	245.8	48.9	311.7	23.9%	22.3%	26.8%
GSM8K
1	67	1570	100	55.5	18.0	35.7	28.5	31.1%	55.6%	58.4%
17	67	1570	100	61.5	279.8	45.4	403.0	31.1%	35.6%	44.0%
26	67	1570	100	63.9	396.4	50.0	527.6	31.1%	27.8%	33.1%
ShareGPT
1	666	231	327	54.1	18.3	39.2	24.1	23.9%	37.9%	31.5%
8	666	231	327	58.8	125.0	46.2	153.2	23.9%	27.1%	22.5%
14	666	231	327	61.8	227.0	49.9	273.9	23.9%	23.8%	20.7%
SuperGLUE_BoolQ
1	207	314	100	54.1	18.4	36.1	26.8	33.4%	49.8%	45.6%
16	207	314	100	60.0	229.7	43.5	303.9	33.4%	38.0%	32.3%
32	207	314	100	62.7	396.4	47.8	507.5	33.4%	31.3%	28.0%
Agieval
1	735	1880	100	53.1	18.7	31.8	34.1	50.3%	66.8%	81.9%
24	735	1880	100	64.0	381.2	43.3	629.0	50.3%	47.8%	65.0%
34	735	1880	100	70.0	494.6	50.2	768.4	50.3%	39.4%	55.3%