Local LLM Deployment Guide¶

Deploy MAESTRO with local Large Language Models for complete privacy, control, and customization using VLLM, SGLang, or other inference servers.

Overview¶

Running MAESTRO with local LLMs provides:

Complete Privacy - No data leaves your infrastructure
Full Control - Customize models and parameters
Cost Efficiency - No API usage fees after initial setup
Performance Tuning - Optimize for your specific hardware
Model Selection - Choose from thousands of open models

Why Local LLMs?¶

Advantages¶

Data Security - Sensitive research stays on-premise
Customization - Fine-tune models for your domain
Availability - No dependency on external services
Cost Predictability - Fixed infrastructure costs

Considerations¶

Hardware Requirements - Significant GPU resources needed
Setup Complexity - More technical configuration
Model Management - Manual updates and optimization
Limited Context - Potentially smaller context windows than cloud models

Recommended Inference Servers¶

VLLM (Recommended)¶

VLLM offers the best balance of performance, features, and compatibility.

Key Features:

High-throughput batch inference
Tensor parallelism for large models
PagedAttention for efficient memory use
Structured generation with xgrammar or outlines
OpenAI-compatible API
Speculative decoding support

Installation: See the VLLM Installation Guide for detailed setup instructions.

SGLang¶

SGLang specializes in structured generation and complex prompting.

Key Features:

Radically faster structured generation
Advanced caching mechanisms
Constrained decoding
Multi-turn conversation optimization
JSON mode enforcement

Installation: See the SGLang Installation Guide for setup instructions.

Other Options¶

Ollama - Simple deployment, good for beginners
LM Studio - GUI-based, easy model management
Text Generation WebUI - Feature-rich interface
llama.cpp - CPU-optimized, minimal resources
TGI (Text Generation Inference) - Production-ready by HuggingFace

Note: Maestro only supports OpenAI compatible API endpoints.

Hardware Requirements¶

Minimum Specifications¶

For 7B-13B models:

GPU: 1x RTX 3090 (24GB VRAM)
RAM: 32GB system memory
Storage: 20GB for models
CPU: 8+ cores

Recommended Specifications¶

For 30B-70B models:

GPU: 2-4x A100 (40-80GB) or 4x RTX 3090
RAM: 64-128GB system memory
Storage: 500GB NVMe SSD
CPU: 16+ cores

Production Specifications¶

For 70B+ models with high throughput: - GPU: 4-8x A100 or H100 - RAM: 256GB+ system memory - Storage: 2TB+ NVMe RAID - CPU: 32+ cores - Network: 10Gbps for distributed inference

VLLM Deployment Examples¶

The following deployment configurations have been extensively tested and confirmed to work excellently with MAESTRO. We've generated numerous research reports using these exact setups, demonstrating their effectiveness across various research styles and complexity levels.

View Real Examples: Check out our Example Reports section to see actual research reports generated by these models, showcasing their capabilities in different research scenarios and writing styles.

Basic Setup (7B Model)¶

Simple deployment for smaller models:

python -m vllm.entrypoints.openai.api_server \
    --model "meta-llama/Llama-3.1-8B-Instruct" \
    --port 5000 \
    --host 0.0.0.0 \
    --served-model-name "localmodel"

Medium Models (30-32B)¶

Qwen 3 32B AWQ¶

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/Qwen_Qwen3-32B-AWQ" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.90 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --enable-prefix-caching \
    --guided-decoding-backend "xgrammar" \
    --chat-template /path/to/qwen3_nonthinking.jinja

Note: We are using a chat-template that disables thinking for this model in the above example. See the Qwen document for this model to get the chat-template.

Qwen 3 30B-A3B Instruct¶

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/cpatonn-Qwen3-30B-A3B-Instruct-AWQ" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.85 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --enable-prefix-caching \
    --guided-decoding-backend "xgrammar" \
    --max-model-len 150000

Large Models (70B+)¶

Qwen 2.5 72B with Speculative Decoding¶

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/Qwen_Qwen2.5-72B-Instruct-AWQ" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.9 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --guided-decoding-backend "xgrammar" \
    --max-model-len 120000 \
    --speculative-config '{
        "model": "/path/to/models/Qwen_Qwen2.5-1.5B-Instruct-AWQ",
        "num_speculative_tokens": 5
    }'

GPT-OSS 120B¶

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/openai_gpt-oss-120b" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.9 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --guided-decoding-backend "xgrammar"

Alternative Models¶

Gemma 3 27B¶

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/RedHatAI_gemma-3-27b-it-FP8-dynamic" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.9 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --guided-decoding-backend "xgrammar" \
    --max-model-len 120000

GPT-OSS 20B¶

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/openai_gpt-oss-20b" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.9 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --guided-decoding-backend "xgrammar"

VLLM Parameters Explained¶

Critical Parameters¶

--tensor-parallel-size: Number of GPUs to split model across
--gpu-memory-utilization: Fraction of GPU memory to use (0.85-0.95)
--max-model-len: Maximum context length (adjust based on VRAM)
--served-model-name: Name to use in API calls

Performance Optimization¶

--enable-prefix-caching: Cache common prefixes for faster inference
--disable-log-requests: Reduce overhead by disabling request logging
--disable-custom-all-reduce: Use NCCL for better multi-GPU performance
--guided-decoding-backend "xgrammar": Enable structured generation

Advanced Features¶

--speculative-config: Enable speculative decoding with draft model
--chat-template: Custom chat template for specific models
--quantization: AWQ, GPTQ, or SqueezeLLM for compressed models

SGLang Deployment Examples¶

SGLang offers excellent performance for structured generation and can be a great alternative to VLLM, especially for models that benefit from its optimized scheduling.

Gemma 3 27B with SGLang¶

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN="1" python -m sglang.launch_server \
    --model-path /path/to/models/gemma-3-27b-it-gptq \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --mem-fraction-static 0.80 \
    --served-model-name "localmodel-large" \
    --schedule-policy lpm \
    --chunked-prefill-size 4096 \
    --schedule-conservativeness 0.3 \
    --context-length 131072 \
    --grammar-backend outlines

Qwen 3 32B with SGLang¶

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN="1" python -m sglang.launch_server \
    --model-path /path/to/models/Qwen_Qwen3-32B-AWQ \
    --tensor-parallel-size 2 \
    --port 5001 \
    --host 0.0.0.0 \
    --mem-fraction-static 0.80 \
    --served-model-name "localmodel-large2" \
    --schedule-policy lpm \
    --chunked-prefill-size 4096 \
    --schedule-conservativeness 0.3 \
    --context-length 131072 \
    --grammar-backend outlines \
    --disable-custom-all-reduce

SGLang Parameters Explained¶

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN: Allows using longer context than model default
--mem-fraction-static: Static memory allocation fraction (0.80 = 80%)
--schedule-policy lpm: Uses Longest Prefix Match scheduling for better batching
--chunked-prefill-size: Chunk size for processing long prompts
--schedule-conservativeness: Controls scheduling aggressiveness (lower = more aggressive)
--grammar-backend outlines: Enables structured generation with Outlines

Configuring MAESTRO for Local LLMs¶

Step 1: Start Your Inference Server¶

Choose and start your VLLM server with appropriate model:

# Example: Start Qwen 2.5 72B
python -m vllm.entrypoints.openai.api_server \
    --model "Qwen/Qwen2.5-72B-Instruct" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --served-model-name "localmodel"

Step 2: Configure MAESTRO Settings¶

Navigate to Settings → AI Config
Select "Custom Provider"
Configure endpoints:

Provider Configuration:

Provider: Custom Provider
API Key: enter any dummy API key
Base URL: http://localhost:5000/v1 or use network IP of machine running VLLM

Model Selection:

Fast Model: localmodel
Mid Model: localmodel
Intelligent Model: localmodel
Verifier Model: localmodel

Step 3: Test Connection¶

Click "Test" button
Verify successful connection
Try a simple query in writing mode

Step 4: Optimize Settings¶

For local models, adjust Research Configuration:

Max Questions: 10-15 (reduce for faster processing)
Research Rounds: 2 (balance quality vs speed)
Writing Passes: 2 (sufficient for most models)
Context Limit: Based on model's max length

Model Selection Guide¶

By Size and Capability¶

Small Models (7-13B)¶

Best For: Quick tasks, summaries, simple research

These may struggle with producing valid structured responses for the research workflows.

Llama 3.1 8B
Mistral 7B
Gemma 2 9B

Medium Models (20-34B)¶

Best For: General research, balanced performance

Qwen 3 32B
GPT-OSS 20B
Gemma 3 27B

Large Models (70B+)¶

Best For: Complex research, best quality

Qwen 2.5 72B
GPT-OSS 120B

By Task Type¶

Research and Analysis¶

Qwen 2.5 72B: Excellent reasoning, long context
GPT-OSS 120B: Comprehensive understanding, tends to make tables and add equations from relevant sources

Creative Writing¶

Gemma 3 27B: Low imagination but low hallucinations too, diverse outputs, concise
Qwen 3 32B: Great for scientific writing at a small size
GPT-OSS 20B: Great for scientific writing at a small size

Technical Documentation¶

Qwen models: Precise, technical writing
Gemma models: Concise and dry
GPT-OSS models: Great at summarizing and pinpointing key information from diverse sources

Model Performance Comparison¶

Quality vs Speed Trade-offs¶

Model	Quality	Speed	VRAM Required (AWQ/4-bit)	Best Use Case
GPT-OSS 120B	Excellent	Moderate	60-65GB	Comprehensive analysis
Qwen 2.5 72B	Excellent	Slow	36-40GB	Complex research
Qwen 3 32B	Very Good	Moderate	20-24GB	Balanced performance
Gemma 3 27B	Very Good	Fast	20-23GB	Concise scientific writing
GPT-OSS 20B	Good	Fast	16GB	Quick research
Llama 3.1 8B	Good	Very Fast	6-8GB	Simple tasks

Example Research Reports¶

We've tested these models extensively with the exact VLLM configurations shown above. View actual research reports generated by these models to evaluate their capabilities:

View Reports by Model¶

Large Models (70B+):

GPT-OSS 120B Examples - Comprehensive analysis with detailed tables and equations
Qwen 2.5 72B Examples - High quality outputs for complex research

Medium Models (20-34B):

Qwen 3 32B Examples - Great balance of quality and speed
Qwen 3 30B-A3B Examples - MOE variant for faster generation, fails at producing valid structured outputs but can complete research tasks
Gemma 3 27B Examples - Concise, factual scientific writing
GPT-OSS 20B Examples - Excellent technical summaries

Structured Generation¶

Structured generation ensures that LLMs produce valid JSON outputs required for MAESTRO's multi-agent research workflow. Without structured generation, models may produce malformed JSON that breaks the research pipeline.

Recommended backends:

VLLM: Use --guided-decoding-backend "xgrammar" or --guided-decoding-backend "outlines"
SGLang: Use --grammar-backend outlines for structured generation support
Other servers: Check documentation for JSON mode or structured output options

Structured generation is highly recommended for reliable operation with MAESTRO's complex agent interactions.

Troubleshooting¶

Common Issues¶

Out of Memory (OOM)¶

Solutions:

Reduce --gpu-memory-utilization to 0.8
Decrease --max-model-len
Use quantized models (AWQ, GPTQ)
Add more GPUs with tensor parallelism

Slow Generation¶

Solutions:

Enable --enable-prefix-caching
Use speculative decoding
Reduce batch size
Use faster models

Connection Refused¶

Solutions:

Verify server is running
Check firewall rules
Ensure correct port
Verify host binding (0.0.0.0)

Performance Optimization¶

Multi-GPU Setup¶

For models requiring multiple GPUs:

# Set CUDA devices
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Use tensor parallelism
--tensor-parallel-size 4

# Optimize NCCL
export NCCL_P2P_DISABLE=1  # If P2P not supported

Memory Optimization¶

# Reduce memory fragmentation
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Enable memory efficient attention
--enable-prefix-caching
--enable-chunked-prefill

Throughput Optimization¶

# Increase batch size for throughput
--max-num-seqs 256

# But reduce for latency
--max-num-seqs 1

Best Practices¶

Model Selection¶

Start with smaller models - Test pipeline first
Scale up gradually - Find optimal size
Use quantization - AWQ/GPTQ for efficiency
Match to task - Don't oversize unnecessarily

Next Steps¶

Choose your model based on hardware and needs
Deploy VLLM with recommended settings
Configure MAESTRO to use local endpoint
Test with examples from our Example Reports
Optimize settings based on performance
Scale as needed with additional resources

Local LLM Deployment Guide¶

Overview¶

Why Local LLMs?¶

Advantages¶

Considerations¶

Recommended Inference Servers¶

VLLM (Recommended)¶

SGLang¶

Other Options¶

Hardware Requirements¶

Minimum Specifications¶

Recommended Specifications¶

Production Specifications¶

VLLM Deployment Examples¶

Basic Setup (7B Model)¶

Medium Models (30-32B)¶

Qwen 3 32B AWQ¶

Qwen 3 30B-A3B Instruct¶

Large Models (70B+)¶

Qwen 2.5 72B with Speculative Decoding¶

GPT-OSS 120B¶

Alternative Models¶

Gemma 3 27B¶

GPT-OSS 20B¶

VLLM Parameters Explained¶

Critical Parameters¶

Performance Optimization¶

Advanced Features¶

SGLang Deployment Examples¶

Gemma 3 27B with SGLang¶

Qwen 3 32B with SGLang¶

SGLang Parameters Explained¶

Configuring MAESTRO for Local LLMs¶

Step 1: Start Your Inference Server¶

Step 2: Configure MAESTRO Settings¶

Step 3: Test Connection¶

Step 4: Optimize Settings¶

Model Selection Guide¶

By Size and Capability¶

Small Models (7-13B)¶

Medium Models (20-34B)¶

Large Models (70B+)¶

By Task Type¶

Research and Analysis¶

Creative Writing¶

Technical Documentation¶

Model Performance Comparison¶

Quality vs Speed Trade-offs¶

Example Research Reports¶

View Reports by Model¶

Structured Generation¶

Troubleshooting¶

Common Issues¶

Out of Memory (OOM)¶

Slow Generation¶

Connection Refused¶

Performance Optimization¶

Multi-GPU Setup¶

Memory Optimization¶

Throughput Optimization¶

Best Practices¶

Model Selection¶

Next Steps¶

Additional Resources¶