The demand for self-hosted large language model (LLM) inference in Asia-Pacific has grown dramatically in 2025–2026, driven by data sovereignty requirements, latency sensitivity for real-time AI applications, and the cost economics of running high-volume inference on your own infrastructure rather than paying OpenAI or Anthropic per-token rates.
A Hong Kong VPS is an increasingly popular choice for AI API hosting targeting mainland China users — CN2 GIA routing keeps inference latency competitive even for real-time chat and code completion applications, and Hong Kong’s legal environment avoids ICP filing requirements.
This guide covers deploying Ollama (the most accessible LLM inference server) and a production-grade API wrapper on a Hong Kong VPS, along with optimisation strategies for high-throughput inference.
Why Host Your LLM in Hong Kong?
- Data sovereignty — keep user prompts and model weights on infrastructure you control, outside US cloud jurisdiction
- China latency — CN2 GIA routing gives mainland Chinese users 30–50ms to your API endpoint versus 150–300ms to US-based OpenAI endpoints
- Cost control — at scale, self-hosted inference on a well-utilised VPS costs a fraction of API-per-token pricing
- Model flexibility — run open-source models (Llama 3, Mistral, Qwen, DeepSeek) without API restrictions or moderation filtering
- No ICP filing — serve Chinese users without mainland China hosting requirements
Hardware Requirements by Model Size
| Model Size | Example Models | Minimum RAM | Recommended VPS |
|---|---|---|---|
| 1–3B parameters | Llama 3.2 1B, Phi-3 mini | 4 GB | 4 GB RAM VPS (CPU inference) |
| 7–8B parameters | Llama 3.1 8B, Mistral 7B, Qwen2.5 7B | 8 GB | 16 GB RAM VPS (CPU inference) |
| 13–14B parameters | Llama 2 13B, Qwen2.5 14B | 16 GB | 32 GB RAM VPS or dedicated server |
| 32–70B parameters | Llama 3.1 70B, DeepSeek 67B | 64 GB | Dedicated server with GPU |
CPU inference is feasible for models up to 13B parameters with acceptable latency for batch or low-concurrency use cases. For real-time applications requiring under 500ms first-token latency, GPU acceleration is necessary for 7B+ models.
Step 1: Provision Your Hong Kong VPS
For a production AI API endpoint hosting a 7B model with CPU inference, order a VPS with:
- 8+ vCPU cores (more cores = faster CPU inference throughput)
- 16+ GB RAM
- 100+ GB NVMe SSD (model weights for a 7B model in Q4 quantisation: ~4 GB)
- Ubuntu 22.04 LTS (best compatibility with AI tooling)
Connect via SSH after provisioning:
ssh root@YOUR_SERVER_IPStep 2: Install Ollama
Ollama is the simplest way to run open-source LLMs with an OpenAI-compatible API — no Python dependency management required.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --versionOllama installs as a systemd service and starts automatically on boot.
Pull Your First Model
# Pull Llama 3.1 8B (4.7 GB download in Q4 quantisation)
ollama pull llama3.1:8b
# Pull Qwen2.5 7B (good for Chinese-language tasks)
ollama pull qwen2.5:7b
# Pull Mistral 7B (strong general reasoning)
ollama pull mistral:7bTest Inference Locally
ollama run llama3.1:8b "Explain CN2 GIA routing in simple terms"Step 3: Expose Ollama as an OpenAI-Compatible API
By default, Ollama only listens on localhost:11434. To expose it externally with the OpenAI-compatible endpoint:
# Edit the Ollama systemd service
sudo systemctl edit ollama.serviceAdd the following override:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"sudo systemctl daemon-reload
sudo systemctl restart ollamaTest the OpenAI-compatible endpoint:
curl http://YOUR_SERVER_IP:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Hello"}]
}'Step 4: Add Nginx Reverse Proxy with HTTPS
Never expose your inference API directly on port 11434 without authentication. Add Nginx as a reverse proxy with SSL and API key authentication.
sudo apt install nginx certbot python3-certbot-nginx -yCreate an Nginx configuration at /etc/nginx/sites-available/ai-api:
server {
listen 80;
server_name your-api-domain.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name your-api-domain.com;
ssl_certificate /etc/letsencrypt/live/your-api-domain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/your-api-domain.com/privkey.pem;
location /v1/ {
# Simple API key authentication
if ($http_authorization != "Bearer YOUR_SECRET_API_KEY") {
return 401;
}
proxy_pass http://127.0.0.1:11434/v1/;
proxy_set_header Host $host;
proxy_read_timeout 300s;
proxy_buffering off;
}
}sudo ln -s /etc/nginx/sites-available/ai-api /etc/nginx/sites-enabled/
sudo certbot --nginx -d your-api-domain.com
sudo nginx -t && sudo systemctl reload nginxStep 5: Set Up Open WebUI (Optional — Chat Interface)
Open WebUI provides a ChatGPT-like interface for your self-hosted models — useful for internal teams or end-user facing applications.
sudo apt install docker.io docker-compose -y
docker run -d \
--name open-webui \
-p 3000:8080 \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
--add-host=host.docker.internal:host-gateway \
--restart always \
ghcr.io/open-webui/open-webui:mainAccess Open WebUI at http://YOUR_SERVER_IP:3000. Add Nginx proxying with SSL using the same pattern as above.
Step 6: Performance Optimisation
CPU Thread Optimisation
# Set Ollama to use all available CPU threads
sudo systemctl edit ollama.service[Service]
Environment="OLLAMA_NUM_THREADS=8"Set OLLAMA_NUM_THREADS to your vCPU count. More threads generally improves tokens-per-second throughput for CPU inference.
Model Quantisation Selection
Model quantisation directly affects RAM usage and inference speed:
- Q2_K — smallest size, lowest quality. Use only for RAM-constrained environments
- Q4_K_M — best balance of size and quality for most use cases. Default recommendation
- Q8_0 — near-full quality, 2× the RAM of Q4. Use when quality is critical
# Pull specific quantisation
ollama pull llama3.1:8b-instruct-q4_K_MConcurrent Request Handling
[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=4"OLLAMA_NUM_PARALLEL controls how many simultaneous inference requests Ollama handles. For CPU inference, set this to 1–2 to avoid thrashing. For GPU inference, 4–8 is reasonable.
Monitoring Your AI API
Monitor inference latency, request volume, and resource usage with Prometheus metrics from Ollama:
# Ollama exposes metrics at
curl http://localhost:11434/metricsScrape these with Prometheus and visualise in Grafana (see our guide on deploying Grafana and Prometheus on Hong Kong VPS) to track tokens-per-second, queue depth, and error rates.
Cost Comparison: Self-Hosted vs OpenAI API
| Scenario | OpenAI GPT-4o | Self-hosted Llama 3.1 8B on HK VPS |
|---|---|---|
| 1M input tokens/month | ~$5.00 | Infrastructure cost only |
| 10M input tokens/month | ~$50.00 | Same infrastructure cost |
| 100M input tokens/month | ~$500.00 | Same infrastructure cost |
| Data stays in HK | No (processed in US) | Yes |
| Latency to China | 150–300ms | 30–50ms (CN2 GIA) |
Self-hosting becomes cost-effective at moderate volumes (typically 5–10M tokens/month) when infrastructure costs are fixed regardless of token count.
Conclusion
Hosting your own LLM API endpoint on a Hong Kong VPS delivers three compounding advantages: lower latency to mainland China via CN2 GIA routing, data sovereignty without US cloud jurisdiction, and linear cost scaling as token volume grows.
Ollama makes the deployment accessible — from installation to a running inference endpoint takes under 30 minutes. Adding Nginx SSL termination and API key authentication brings it to production standard.
For teams currently paying significant monthly API bills to OpenAI or running latency-sensitive AI applications for Chinese users, a Hong Kong VPS-hosted LLM is worth evaluating seriously in 2026.
Get started: Browse Hong Kong VPS plans — select a 16 GB RAM plan for 7B model hosting with comfortable headroom.