Skip to content

Local LLM Deployment Guide

Deploy MAESTRO with local Large Language Models for complete privacy, control, and customization using VLLM, SGLang, or other inference servers.

Overview

Running MAESTRO with local LLMs provides:

  • Complete Privacy - No data leaves your infrastructure
  • Full Control - Customize models and parameters
  • Cost Efficiency - No API usage fees after initial setup
  • Performance Tuning - Optimize for your specific hardware
  • Model Selection - Choose from thousands of open models

Why Local LLMs?

Advantages

  • Data Security - Sensitive research stays on-premise
  • Customization - Fine-tune models for your domain
  • Availability - No dependency on external services
  • Cost Predictability - Fixed infrastructure costs

Considerations

  • Hardware Requirements - Significant GPU resources needed
  • Setup Complexity - More technical configuration
  • Model Management - Manual updates and optimization
  • Limited Context - Potentially smaller context windows than cloud models

VLLM offers the best balance of performance, features, and compatibility.

Key Features:

  • High-throughput batch inference
  • Tensor parallelism for large models
  • PagedAttention for efficient memory use
  • Structured generation with xgrammar or outlines
  • OpenAI-compatible API
  • Speculative decoding support

Installation: See the VLLM Installation Guide for detailed setup instructions.

SGLang

SGLang specializes in structured generation and complex prompting.

Key Features:

  • Radically faster structured generation
  • Advanced caching mechanisms
  • Constrained decoding
  • Multi-turn conversation optimization
  • JSON mode enforcement

Installation: See the SGLang Installation Guide for setup instructions.

Other Options

  • Ollama - Simple deployment, good for beginners
  • LM Studio - GUI-based, easy model management
  • Text Generation WebUI - Feature-rich interface
  • llama.cpp - CPU-optimized, minimal resources
  • TGI (Text Generation Inference) - Production-ready by HuggingFace

Note: Maestro only supports OpenAI compatible API endpoints.

Hardware Requirements

Minimum Specifications

For 7B-13B models:

  • GPU: 1x RTX 3090 (24GB VRAM)
  • RAM: 32GB system memory
  • Storage: 20GB for models
  • CPU: 8+ cores

For 30B-70B models:

  • GPU: 2-4x A100 (40-80GB) or 4x RTX 3090
  • RAM: 64-128GB system memory
  • Storage: 500GB NVMe SSD
  • CPU: 16+ cores

Production Specifications

For 70B+ models with high throughput: - GPU: 4-8x A100 or H100 - RAM: 256GB+ system memory - Storage: 2TB+ NVMe RAID - CPU: 32+ cores - Network: 10Gbps for distributed inference

VLLM Deployment Examples

The following deployment configurations have been extensively tested and confirmed to work excellently with MAESTRO. We've generated numerous research reports using these exact setups, demonstrating their effectiveness across various research styles and complexity levels.

View Real Examples: Check out our Example Reports section to see actual research reports generated by these models, showcasing their capabilities in different research scenarios and writing styles.

Basic Setup (7B Model)

Simple deployment for smaller models:

python -m vllm.entrypoints.openai.api_server \
    --model "meta-llama/Llama-3.1-8B-Instruct" \
    --port 5000 \
    --host 0.0.0.0 \
    --served-model-name "localmodel"

Medium Models (30-32B)

Qwen 3 32B AWQ

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/Qwen_Qwen3-32B-AWQ" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.90 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --enable-prefix-caching \
    --guided-decoding-backend "xgrammar" \
    --chat-template /path/to/qwen3_nonthinking.jinja

Note: We are using a chat-template that disables thinking for this model in the above example. See the Qwen document for this model to get the chat-template.

Qwen 3 30B-A3B Instruct

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/cpatonn-Qwen3-30B-A3B-Instruct-AWQ" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.85 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --enable-prefix-caching \
    --guided-decoding-backend "xgrammar" \
    --max-model-len 150000

Large Models (70B+)

Qwen 2.5 72B with Speculative Decoding

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/Qwen_Qwen2.5-72B-Instruct-AWQ" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.9 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --guided-decoding-backend "xgrammar" \
    --max-model-len 120000 \
    --speculative-config '{
        "model": "/path/to/models/Qwen_Qwen2.5-1.5B-Instruct-AWQ",
        "num_speculative_tokens": 5
    }'

GPT-OSS 120B

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/openai_gpt-oss-120b" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.9 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --guided-decoding-backend "xgrammar"

Alternative Models

Gemma 3 27B

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/RedHatAI_gemma-3-27b-it-FP8-dynamic" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.9 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --guided-decoding-backend "xgrammar" \
    --max-model-len 120000

GPT-OSS 20B

python -m vllm.entrypoints.openai.api_server \
    --model "/path/to/models/openai_gpt-oss-20b" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.9 \
    --served-model-name "localmodel-large" \
    --disable-log-requests \
    --disable-custom-all-reduce \
    --guided-decoding-backend "xgrammar"

VLLM Parameters Explained

Critical Parameters

  • --tensor-parallel-size: Number of GPUs to split model across
  • --gpu-memory-utilization: Fraction of GPU memory to use (0.85-0.95)
  • --max-model-len: Maximum context length (adjust based on VRAM)
  • --served-model-name: Name to use in API calls

Performance Optimization

  • --enable-prefix-caching: Cache common prefixes for faster inference
  • --disable-log-requests: Reduce overhead by disabling request logging
  • --disable-custom-all-reduce: Use NCCL for better multi-GPU performance
  • --guided-decoding-backend "xgrammar": Enable structured generation

Advanced Features

  • --speculative-config: Enable speculative decoding with draft model
  • --chat-template: Custom chat template for specific models
  • --quantization: AWQ, GPTQ, or SqueezeLLM for compressed models

SGLang Deployment Examples

SGLang offers excellent performance for structured generation and can be a great alternative to VLLM, especially for models that benefit from its optimized scheduling.

Gemma 3 27B with SGLang

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN="1" python -m sglang.launch_server \
    --model-path /path/to/models/gemma-3-27b-it-gptq \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --mem-fraction-static 0.80 \
    --served-model-name "localmodel-large" \
    --schedule-policy lpm \
    --chunked-prefill-size 4096 \
    --schedule-conservativeness 0.3 \
    --context-length 131072 \
    --grammar-backend outlines

Qwen 3 32B with SGLang

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN="1" python -m sglang.launch_server \
    --model-path /path/to/models/Qwen_Qwen3-32B-AWQ \
    --tensor-parallel-size 2 \
    --port 5001 \
    --host 0.0.0.0 \
    --mem-fraction-static 0.80 \
    --served-model-name "localmodel-large2" \
    --schedule-policy lpm \
    --chunked-prefill-size 4096 \
    --schedule-conservativeness 0.3 \
    --context-length 131072 \
    --grammar-backend outlines \
    --disable-custom-all-reduce

SGLang Parameters Explained

  • SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN: Allows using longer context than model default
  • --mem-fraction-static: Static memory allocation fraction (0.80 = 80%)
  • --schedule-policy lpm: Uses Longest Prefix Match scheduling for better batching
  • --chunked-prefill-size: Chunk size for processing long prompts
  • --schedule-conservativeness: Controls scheduling aggressiveness (lower = more aggressive)
  • --grammar-backend outlines: Enables structured generation with Outlines

Configuring MAESTRO for Local LLMs

Step 1: Start Your Inference Server

Choose and start your VLLM server with appropriate model:

# Example: Start Qwen 2.5 72B
python -m vllm.entrypoints.openai.api_server \
    --model "Qwen/Qwen2.5-72B-Instruct" \
    --tensor-parallel-size 4 \
    --port 5000 \
    --host 0.0.0.0 \
    --served-model-name "localmodel"

Step 2: Configure MAESTRO Settings

  1. Navigate to Settings → AI Config
  2. Select "Custom Provider"
  3. Configure endpoints:

Provider Configuration:

  • Provider: Custom Provider
  • API Key: enter any dummy API key
  • Base URL: http://localhost:5000/v1 or use network IP of machine running VLLM

Model Selection:

  • Fast Model: localmodel
  • Mid Model: localmodel
  • Intelligent Model: localmodel
  • Verifier Model: localmodel

Step 3: Test Connection

  1. Click "Test" button
  2. Verify successful connection
  3. Try a simple query in writing mode

Step 4: Optimize Settings

For local models, adjust Research Configuration:

  • Max Questions: 10-15 (reduce for faster processing)
  • Research Rounds: 2 (balance quality vs speed)
  • Writing Passes: 2 (sufficient for most models)
  • Context Limit: Based on model's max length

Model Selection Guide

By Size and Capability

Small Models (7-13B)

Best For: Quick tasks, summaries, simple research

These may struggle with producing valid structured responses for the research workflows.

  • Llama 3.1 8B
  • Mistral 7B
  • Gemma 2 9B

Medium Models (20-34B)

Best For: General research, balanced performance

  • Qwen 3 32B
  • GPT-OSS 20B
  • Gemma 3 27B

Large Models (70B+)

Best For: Complex research, best quality

  • Qwen 2.5 72B
  • GPT-OSS 120B

By Task Type

Research and Analysis

  • Qwen 2.5 72B: Excellent reasoning, long context
  • GPT-OSS 120B: Comprehensive understanding, tends to make tables and add equations from relevant sources

Creative Writing

  • Gemma 3 27B: Low imagination but low hallucinations too, diverse outputs, concise
  • Qwen 3 32B: Great for scientific writing at a small size
  • GPT-OSS 20B: Great for scientific writing at a small size

Technical Documentation

  • Qwen models: Precise, technical writing
  • Gemma models: Concise and dry
  • GPT-OSS models: Great at summarizing and pinpointing key information from diverse sources

Model Performance Comparison

Quality vs Speed Trade-offs

Model Quality Speed VRAM Required (AWQ/4-bit) Best Use Case
GPT-OSS 120B Excellent Moderate 60-65GB Comprehensive analysis
Qwen 2.5 72B Excellent Slow 36-40GB Complex research
Qwen 3 32B Very Good Moderate 20-24GB Balanced performance
Gemma 3 27B Very Good Fast 20-23GB Concise scientific writing
GPT-OSS 20B Good Fast 16GB Quick research
Llama 3.1 8B Good Very Fast 6-8GB Simple tasks

Example Research Reports

We've tested these models extensively with the exact VLLM configurations shown above. View actual research reports generated by these models to evaluate their capabilities:

View Reports by Model

Large Models (70B+):

Medium Models (20-34B):

Structured Generation

Structured generation ensures that LLMs produce valid JSON outputs required for MAESTRO's multi-agent research workflow. Without structured generation, models may produce malformed JSON that breaks the research pipeline.

Recommended backends:

  • VLLM: Use --guided-decoding-backend "xgrammar" or --guided-decoding-backend "outlines"
  • SGLang: Use --grammar-backend outlines for structured generation support
  • Other servers: Check documentation for JSON mode or structured output options

Structured generation is highly recommended for reliable operation with MAESTRO's complex agent interactions.

Troubleshooting

Common Issues

Out of Memory (OOM)

Solutions:

  • Reduce --gpu-memory-utilization to 0.8
  • Decrease --max-model-len
  • Use quantized models (AWQ, GPTQ)
  • Add more GPUs with tensor parallelism

Slow Generation

Solutions:

  • Enable --enable-prefix-caching
  • Use speculative decoding
  • Reduce batch size
  • Use faster models

Connection Refused

Solutions:

  • Verify server is running
  • Check firewall rules
  • Ensure correct port
  • Verify host binding (0.0.0.0)

Performance Optimization

Multi-GPU Setup

For models requiring multiple GPUs:

# Set CUDA devices
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Use tensor parallelism
--tensor-parallel-size 4

# Optimize NCCL
export NCCL_P2P_DISABLE=1  # If P2P not supported

Memory Optimization

# Reduce memory fragmentation
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Enable memory efficient attention
--enable-prefix-caching
--enable-chunked-prefill

Throughput Optimization

# Increase batch size for throughput
--max-num-seqs 256

# But reduce for latency
--max-num-seqs 1

Best Practices

Model Selection

  1. Start with smaller models - Test pipeline first
  2. Scale up gradually - Find optimal size
  3. Use quantization - AWQ/GPTQ for efficiency
  4. Match to task - Don't oversize unnecessarily

Next Steps

  1. Choose your model based on hardware and needs
  2. Deploy VLLM with recommended settings
  3. Configure MAESTRO to use local endpoint
  4. Test with examples from our Example Reports
  5. Optimize settings based on performance
  6. Scale as needed with additional resources

Additional Resources