PD-Colocated with Mooncake Multi-Instance#
Getting Started#
vLLM-Ascend now supports PD-colocated deployment with Mooncake features. This guide provides step-by-step instructions to test these features with constrained resources.
Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2 nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and uses PD-colocated deployment.
Verify Multi-Node Communication Environment#
Physical Layer Requirements#
The two Atlas 800T A2 nodes must be physically interconnected via a RoCE network. Without RoCE interconnection, cross-node KV Cache access performance will be significantly degraded.
All NPU cards must communicate properly. Intra-node communication uses HCCS, while inter-node communication uses the RoCE network.
Verification Process#
The following process serves as a reference example. Please modify parameters such as IP addresses according to your actual environment.
Single Node Verification:
Execute the following commands sequentially. The results must all be
successand the status must beUP:# Check the remote switch ports for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done # Get the link status of the Ethernet ports (UP or DOWN) for i in {0..7}; do hccn_tool -i $i -link -g ; done # Check the network health status for i in {0..7}; do hccn_tool -i $i -net_health -g ; done # View the network detected IP configuration for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done # View gateway configuration for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
Check NPU HCCN Configuration:
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
cat /etc/hccn.confGet NPU IP Addresses:
for i in {0..7}; do hccn_tool -i $i -ip -g; done
Cross-Node PING Test:
# Execute the following command on each node, replacing x.x.x.x # with the target node's NPU card address. for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done
Check NPU TLS Configuration
# The tls settings should be consistent across all nodes. for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
Run with Docker#
Start a Docker container on each node.
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
export NAME=vllm-ascend
# Run the container using the defined variables
# This test uses four NPU cards to create the container.
# Mount the hccn.conf file from the host node into the container.
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:\
/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
(Optional) Install Mooncake#
Mooncake is pre-installed and functional in the v0.11.0 image. The following installation steps are optional.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Installation and compilation guide: kvcache-ai/Mooncake.
First, obtain the Mooncake project using the following command:
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
git submodule update --init --recursive
Install MPI:
apt-get install mpich libmpich-dev -y
Install the relevant dependencies (Go installation is not required):
bash dependencies.sh -y
Compile and install:
mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON
make -j
make install
After installation, verify that Mooncake is installed correctly:
python -c "import mooncake; print(mooncake.__file__)"
# Expected output path:
# /usr/local/Ascend/ascend-toolkit/latest/python/
# site-packages/mooncake/__init__.py
Start Mooncake Master Service#
To start the Mooncake master service in one of the node containers, use the following command:
docker exec -it vllm-ascend bash
cd /vllm-workspace/Mooncake
mooncake_master --port 50088 \
--eviction_high_watermark_ratio 0.95 \
--eviction_ratio 0.05
Parameter |
Value |
Explanation |
|---|---|---|
port |
50088 |
Port for the master service |
eviction_high_watermark_ratio |
0.95 |
High watermark ratio (95% threshold) |
eviction_ratio |
0.05 |
Percentage to evict when full (5%) |
Create a Mooncake Configuration File Named mooncake.json#
The template for the mooncake.json file is as follows:
{
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"use_ascend_direct": true,
"master_server_address": "<your_server_ip>:50088",
"global_segment_size": 107374182400
}
Parameter |
Value |
Explanation |
|---|---|---|
metadata_server |
P2PHANDSHAKE |
Point-to-point handshake mode |
protocol |
ascend |
Ascend proprietary protocol |
use_ascend_direct |
true |
Enable direct hardware access |
master_server_address |
90.90.100.188:50088(for example) |
Master server address |
global_segment_size |
107374182400 |
Size per segment (100 GB) |
vLLM Instance Deployment#
Create containers on both Node 1 and Node 2, and launch the Qwen2.5-72B-Instruct model service in each to test the reusability and performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes cards [0-3] on the second server.
Deploy Instance 1#
Replace file paths, host, and port parameters based on your actual environment configuration.
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\
latest/python/site-packages:$LD_LIBRARY_PATH
export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json"
# NPU buffer pool: quantity:size(MB)
# Allocates 4 buffers of 8MB each for KV transfer
export ASCEND_BUFFER_POOL=4:8
vllm serve <path_to_your_model>/Qwen2.5-72B-Instruct/ \
--served-model-name qwen \
--dtype bfloat16 \
--max-model-len 25600 \
--tensor-parallel-size 4 \
--host <your_server_ip> \
--port 8002 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.9 \
--kv-transfer-config '{
"kv_connector": "MooncakeConnectorStoreV1",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"use_layerwise": false,
"mooncake_rpc_port": "0",
"load_async": true,
"register_buffer": true
}
}'
Deploy Instance 2#
The deployment method for Instance 2 is identical to Instance 1. Simply
modify the --host and --port parameters according to your Instance 2
configuration.
Configuration Parameters#
Parameter |
Value |
Explanation |
|---|---|---|
kv_connector |
MooncakeConnectorStoreV1 |
Use StoreV1 version |
kv_role |
kv_both |
Enable both produce and consume |
use_layerwise |
false |
Transfer entire cache (see note) |
mooncake_rpc_port |
0 |
Automatic port assignment |
load_async |
true |
Enable asynchronous loading |
register_buffer |
true |
Required for PD-colocated mode |
Note on use_layerwise:
false: Transfer entire KV Cache (suitable for cross-node with sufficient bandwidth)true: Layer-by-layer transfer (suitable for single-node memory constraints)
Benchmark#
We recommend using the AISBench tool to assess performance. The test uses Dataset A, consisting of fully random data, with the following configuration:
Input/output tokens: 1024/10
Total requests: 100
Concurrency: 25
The test procedure consists of three steps:
Step 1: Baseline (No Cache)#
Send Dataset A to Instance 1 on Node 1 and record the Time to First Token (TTFT) as TTFT1.
Preparation for Step 2#
Before Step 2, send a fully random Dataset B to Instance 1. Due to the unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy, Dataset B’s cache evicts Dataset A’s cache from HBM, leaving Dataset A’s cache only in Node 1’s DRAM.
Step 2: Local DRAM Hit#
Send Dataset A to Instance 1 again to measure the performance when hitting the KV Cache in local DRAM. Record the TTFT as TTFT2.
Step 3: Cross-Node DRAM Hit#
Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results in a cross-node KV Cache hit from Node 1’s DRAM. Record the TTFT as TTFT3.
Model Configuration:
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="<path_to_your_model>/Qwen2.5-72B-Instruct",
model="qwen",
request_rate = 0,
retry = 2,
host_ip = "<your_server_ip>",
host_port = 8002,
max_out_len = 10,
batch_size= 25,
trust_remote_code=False,
generation_kwargs = dict(
temperature = 0,
ignore_eos = True,
),
)
]
Performance Benchmarking Commands:
ais_bench --models vllm_api_stream_chat \
--datasets gsm8k_gen_0_shot_cot_str_perf \
--debug --summarizer default_perf --mode perf
Test Results#
Requests |
Concur |
TTFT1 (ms) |
TTFT2 (ms) |
TTFT3 (ms) |
|---|---|---|---|---|
100 |
25 |
2322 |
739 |
948 |