Local LLM Deployment Guide¶
Deploy MAESTRO with local Large Language Models for complete privacy, control, and customization using VLLM, SGLang, or other inference servers.
Overview¶
Running MAESTRO with local LLMs provides:
- Complete Privacy - No data leaves your infrastructure
- Full Control - Customize models and parameters
- Cost Efficiency - No API usage fees after initial setup
- Performance Tuning - Optimize for your specific hardware
- Model Selection - Choose from thousands of open models
Why Local LLMs?¶
Advantages¶
- Data Security - Sensitive research stays on-premise
- Customization - Fine-tune models for your domain
- Availability - No dependency on external services
- Cost Predictability - Fixed infrastructure costs
Considerations¶
- Hardware Requirements - Significant GPU resources needed
- Setup Complexity - More technical configuration
- Model Management - Manual updates and optimization
- Limited Context - Potentially smaller context windows than cloud models
Recommended Inference Servers¶
VLLM (Recommended)¶
VLLM offers the best balance of performance, features, and compatibility.
Key Features:
- High-throughput batch inference
- Tensor parallelism for large models
- PagedAttention for efficient memory use
- Structured generation with
xgrammar
oroutlines
- OpenAI-compatible API
- Speculative decoding support
Installation: See the VLLM Installation Guide for detailed setup instructions.
SGLang¶
SGLang specializes in structured generation and complex prompting.
Key Features:
- Radically faster structured generation
- Advanced caching mechanisms
- Constrained decoding
- Multi-turn conversation optimization
- JSON mode enforcement
Installation: See the SGLang Installation Guide for setup instructions.
Other Options¶
- Ollama - Simple deployment, good for beginners
- LM Studio - GUI-based, easy model management
- Text Generation WebUI - Feature-rich interface
- llama.cpp - CPU-optimized, minimal resources
- TGI (Text Generation Inference) - Production-ready by HuggingFace
Note: Maestro only supports OpenAI compatible API endpoints.
Hardware Requirements¶
Minimum Specifications¶
For 7B-13B models:
- GPU: 1x RTX 3090 (24GB VRAM)
- RAM: 32GB system memory
- Storage: 20GB for models
- CPU: 8+ cores
Recommended Specifications¶
For 30B-70B models:
- GPU: 2-4x A100 (40-80GB) or 4x RTX 3090
- RAM: 64-128GB system memory
- Storage: 500GB NVMe SSD
- CPU: 16+ cores
Production Specifications¶
For 70B+ models with high throughput: - GPU: 4-8x A100 or H100 - RAM: 256GB+ system memory - Storage: 2TB+ NVMe RAID - CPU: 32+ cores - Network: 10Gbps for distributed inference
VLLM Deployment Examples¶
The following deployment configurations have been extensively tested and confirmed to work excellently with MAESTRO. We've generated numerous research reports using these exact setups, demonstrating their effectiveness across various research styles and complexity levels.
View Real Examples: Check out our Example Reports section to see actual research reports generated by these models, showcasing their capabilities in different research scenarios and writing styles.
Basic Setup (7B Model)¶
Simple deployment for smaller models:
python -m vllm.entrypoints.openai.api_server \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--port 5000 \
--host 0.0.0.0 \
--served-model-name "localmodel"
Medium Models (30-32B)¶
Qwen 3 32B AWQ¶
python -m vllm.entrypoints.openai.api_server \
--model "/path/to/models/Qwen_Qwen3-32B-AWQ" \
--tensor-parallel-size 4 \
--port 5000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.90 \
--served-model-name "localmodel-large" \
--disable-log-requests \
--disable-custom-all-reduce \
--enable-prefix-caching \
--guided-decoding-backend "xgrammar" \
--chat-template /path/to/qwen3_nonthinking.jinja
Note: We are using a chat-template that disables thinking for this model in the above example. See the Qwen document for this model to get the chat-template.
Qwen 3 30B-A3B Instruct¶
python -m vllm.entrypoints.openai.api_server \
--model "/path/to/models/cpatonn-Qwen3-30B-A3B-Instruct-AWQ" \
--tensor-parallel-size 4 \
--port 5000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.85 \
--served-model-name "localmodel-large" \
--disable-log-requests \
--disable-custom-all-reduce \
--enable-prefix-caching \
--guided-decoding-backend "xgrammar" \
--max-model-len 150000
Large Models (70B+)¶
Qwen 2.5 72B with Speculative Decoding¶
python -m vllm.entrypoints.openai.api_server \
--model "/path/to/models/Qwen_Qwen2.5-72B-Instruct-AWQ" \
--tensor-parallel-size 4 \
--port 5000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.9 \
--served-model-name "localmodel-large" \
--disable-log-requests \
--disable-custom-all-reduce \
--guided-decoding-backend "xgrammar" \
--max-model-len 120000 \
--speculative-config '{
"model": "/path/to/models/Qwen_Qwen2.5-1.5B-Instruct-AWQ",
"num_speculative_tokens": 5
}'
GPT-OSS 120B¶
python -m vllm.entrypoints.openai.api_server \
--model "/path/to/models/openai_gpt-oss-120b" \
--tensor-parallel-size 4 \
--port 5000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.9 \
--served-model-name "localmodel-large" \
--disable-log-requests \
--disable-custom-all-reduce \
--guided-decoding-backend "xgrammar"
Alternative Models¶
Gemma 3 27B¶
python -m vllm.entrypoints.openai.api_server \
--model "/path/to/models/RedHatAI_gemma-3-27b-it-FP8-dynamic" \
--tensor-parallel-size 4 \
--port 5000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.9 \
--served-model-name "localmodel-large" \
--disable-log-requests \
--disable-custom-all-reduce \
--guided-decoding-backend "xgrammar" \
--max-model-len 120000
GPT-OSS 20B¶
python -m vllm.entrypoints.openai.api_server \
--model "/path/to/models/openai_gpt-oss-20b" \
--tensor-parallel-size 4 \
--port 5000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.9 \
--served-model-name "localmodel-large" \
--disable-log-requests \
--disable-custom-all-reduce \
--guided-decoding-backend "xgrammar"
VLLM Parameters Explained¶
Critical Parameters¶
--tensor-parallel-size
: Number of GPUs to split model across--gpu-memory-utilization
: Fraction of GPU memory to use (0.85-0.95)--max-model-len
: Maximum context length (adjust based on VRAM)--served-model-name
: Name to use in API calls
Performance Optimization¶
--enable-prefix-caching
: Cache common prefixes for faster inference--disable-log-requests
: Reduce overhead by disabling request logging--disable-custom-all-reduce
: Use NCCL for better multi-GPU performance--guided-decoding-backend "xgrammar"
: Enable structured generation
Advanced Features¶
--speculative-config
: Enable speculative decoding with draft model--chat-template
: Custom chat template for specific models--quantization
: AWQ, GPTQ, or SqueezeLLM for compressed models
SGLang Deployment Examples¶
SGLang offers excellent performance for structured generation and can be a great alternative to VLLM, especially for models that benefit from its optimized scheduling.
Gemma 3 27B with SGLang¶
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN="1" python -m sglang.launch_server \
--model-path /path/to/models/gemma-3-27b-it-gptq \
--tensor-parallel-size 4 \
--port 5000 \
--host 0.0.0.0 \
--mem-fraction-static 0.80 \
--served-model-name "localmodel-large" \
--schedule-policy lpm \
--chunked-prefill-size 4096 \
--schedule-conservativeness 0.3 \
--context-length 131072 \
--grammar-backend outlines
Qwen 3 32B with SGLang¶
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN="1" python -m sglang.launch_server \
--model-path /path/to/models/Qwen_Qwen3-32B-AWQ \
--tensor-parallel-size 2 \
--port 5001 \
--host 0.0.0.0 \
--mem-fraction-static 0.80 \
--served-model-name "localmodel-large2" \
--schedule-policy lpm \
--chunked-prefill-size 4096 \
--schedule-conservativeness 0.3 \
--context-length 131072 \
--grammar-backend outlines \
--disable-custom-all-reduce
SGLang Parameters Explained¶
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN
: Allows using longer context than model default--mem-fraction-static
: Static memory allocation fraction (0.80 = 80%)--schedule-policy lpm
: Uses Longest Prefix Match scheduling for better batching--chunked-prefill-size
: Chunk size for processing long prompts--schedule-conservativeness
: Controls scheduling aggressiveness (lower = more aggressive)--grammar-backend outlines
: Enables structured generation with Outlines
Configuring MAESTRO for Local LLMs¶
Step 1: Start Your Inference Server¶
Choose and start your VLLM server with appropriate model:
# Example: Start Qwen 2.5 72B
python -m vllm.entrypoints.openai.api_server \
--model "Qwen/Qwen2.5-72B-Instruct" \
--tensor-parallel-size 4 \
--port 5000 \
--host 0.0.0.0 \
--served-model-name "localmodel"
Step 2: Configure MAESTRO Settings¶
- Navigate to Settings → AI Config
- Select "Custom Provider"
- Configure endpoints:
Provider Configuration:
- Provider: Custom Provider
- API Key: enter any dummy API key
- Base URL:
http://localhost:5000/v1
or use network IP of machine running VLLM
Model Selection:
- Fast Model:
localmodel
- Mid Model:
localmodel
- Intelligent Model:
localmodel
- Verifier Model:
localmodel
Step 3: Test Connection¶
- Click "Test" button
- Verify successful connection
- Try a simple query in writing mode
Step 4: Optimize Settings¶
For local models, adjust Research Configuration:
- Max Questions: 10-15 (reduce for faster processing)
- Research Rounds: 2 (balance quality vs speed)
- Writing Passes: 2 (sufficient for most models)
- Context Limit: Based on model's max length
Model Selection Guide¶
By Size and Capability¶
Small Models (7-13B)¶
Best For: Quick tasks, summaries, simple research
These may struggle with producing valid structured responses for the research workflows.
- Llama 3.1 8B
- Mistral 7B
- Gemma 2 9B
Medium Models (20-34B)¶
Best For: General research, balanced performance
- Qwen 3 32B
- GPT-OSS 20B
- Gemma 3 27B
Large Models (70B+)¶
Best For: Complex research, best quality
- Qwen 2.5 72B
- GPT-OSS 120B
By Task Type¶
Research and Analysis¶
- Qwen 2.5 72B: Excellent reasoning, long context
- GPT-OSS 120B: Comprehensive understanding, tends to make tables and add equations from relevant sources
Creative Writing¶
- Gemma 3 27B: Low imagination but low hallucinations too, diverse outputs, concise
- Qwen 3 32B: Great for scientific writing at a small size
- GPT-OSS 20B: Great for scientific writing at a small size
Technical Documentation¶
- Qwen models: Precise, technical writing
- Gemma models: Concise and dry
- GPT-OSS models: Great at summarizing and pinpointing key information from diverse sources
Model Performance Comparison¶
Quality vs Speed Trade-offs¶
Model | Quality | Speed | VRAM Required (AWQ/4-bit) | Best Use Case |
---|---|---|---|---|
GPT-OSS 120B | Excellent | Moderate | 60-65GB | Comprehensive analysis |
Qwen 2.5 72B | Excellent | Slow | 36-40GB | Complex research |
Qwen 3 32B | Very Good | Moderate | 20-24GB | Balanced performance |
Gemma 3 27B | Very Good | Fast | 20-23GB | Concise scientific writing |
GPT-OSS 20B | Good | Fast | 16GB | Quick research |
Llama 3.1 8B | Good | Very Fast | 6-8GB | Simple tasks |
Example Research Reports¶
We've tested these models extensively with the exact VLLM configurations shown above. View actual research reports generated by these models to evaluate their capabilities:
View Reports by Model¶
Large Models (70B+):
- GPT-OSS 120B Examples - Comprehensive analysis with detailed tables and equations
- Qwen 2.5 72B Examples - High quality outputs for complex research
Medium Models (20-34B):
- Qwen 3 32B Examples - Great balance of quality and speed
- Qwen 3 30B-A3B Examples - MOE variant for faster generation, fails at producing valid structured outputs but can complete research tasks
- Gemma 3 27B Examples - Concise, factual scientific writing
- GPT-OSS 20B Examples - Excellent technical summaries
Structured Generation¶
Structured generation ensures that LLMs produce valid JSON outputs required for MAESTRO's multi-agent research workflow. Without structured generation, models may produce malformed JSON that breaks the research pipeline.
Recommended backends:
- VLLM: Use
--guided-decoding-backend "xgrammar"
or--guided-decoding-backend "outlines"
- SGLang: Use
--grammar-backend outlines
for structured generation support - Other servers: Check documentation for JSON mode or structured output options
Structured generation is highly recommended for reliable operation with MAESTRO's complex agent interactions.
Troubleshooting¶
Common Issues¶
Out of Memory (OOM)¶
Solutions:
- Reduce
--gpu-memory-utilization
to 0.8 - Decrease
--max-model-len
- Use quantized models (AWQ, GPTQ)
- Add more GPUs with tensor parallelism
Slow Generation¶
Solutions:
- Enable
--enable-prefix-caching
- Use speculative decoding
- Reduce batch size
- Use faster models
Connection Refused¶
Solutions:
- Verify server is running
- Check firewall rules
- Ensure correct port
- Verify host binding (0.0.0.0)
Performance Optimization¶
Multi-GPU Setup¶
For models requiring multiple GPUs:
# Set CUDA devices
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Use tensor parallelism
--tensor-parallel-size 4
# Optimize NCCL
export NCCL_P2P_DISABLE=1 # If P2P not supported
Memory Optimization¶
# Reduce memory fragmentation
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Enable memory efficient attention
--enable-prefix-caching
--enable-chunked-prefill
Throughput Optimization¶
Best Practices¶
Model Selection¶
- Start with smaller models - Test pipeline first
- Scale up gradually - Find optimal size
- Use quantization - AWQ/GPTQ for efficiency
- Match to task - Don't oversize unnecessarily
Next Steps¶
- Choose your model based on hardware and needs
- Deploy VLLM with recommended settings
- Configure MAESTRO to use local endpoint
- Test with examples from our Example Reports
- Optimize settings based on performance
- Scale as needed with additional resources