Documents Overview¶
The document management system is the foundation of MAESTRO's research capabilities, allowing you to upload, organize, and search through your document library.
How Document Processing Works¶
When you upload a document, MAESTRO:
- Validates the file and checks for duplicates (SHA256)
- Stores the original file and creates a database record
- Converts the document to markdown format using:
- PDF: Marker for extraction (GPU-accelerated when available)
- Word: python-docx for .docx/.doc files
- Markdown: Direct processing
- Extracts metadata using LLM (title, authors, abstract)
- Chunks content into overlapping paragraphs (2 per chunk, 1 overlap)
- Generates BGE-M3 embeddings (dense and sparse)
- Indexes for semantic search
Supported File Formats¶
- PDF (.pdf) - Processed with Marker
- Microsoft Word (.docx, .doc) - Extracted with python-docx
- Markdown (.md, .markdown) - Direct processing
- Plain Text (.txt) - Simple text extraction
Document Library Features¶
Organization¶
- Document Groups - Create collections for projects
- Metadata - Automatic extraction of title, authors, abstract
- Duplicate Detection - SHA256 hash prevents duplicate uploads
Search Capabilities¶
- Semantic Search - Find documents by meaning using embeddings
- Hybrid Search - Combines dense and sparse embeddings
- API Endpoint -
/api/documents/search
for programmatic access
Processing Status¶
Documents go through these stages:
- Uploading - File being transferred
- Processing - Conversion and embedding generation
- Completed - Ready for use
- Failed - Error occurred (check
processing_error
field)
Using the Document System¶
Web Interface¶
- Navigate to the Documents tab
- Click "Upload Documents" or drag and drop files
- Monitor processing status
- Search documents using the search bar
CLI Upload¶
Use the maestro-cli.sh script for bulk uploads:
# Upload documents for a user
./maestro-cli.sh ingest <username> <directory>
# Force re-embedding
./maestro-cli.sh ingest <username> <directory> --force-reembed
# Add to specific group
./maestro-cli.sh ingest <username> <directory> --group <group_id>
Creating Document Groups¶
Groups help organize documents:
# Create a group via CLI
./maestro-cli.sh create-group <username> "Group Name"
# List groups
./maestro-cli.sh list-groups
Search and Retrieval¶
Search Interface¶
Enter natural language queries to find relevant documents. The system uses: - BGE-M3 dense embeddings (1024 dimensions) - BGE-M3 sparse embeddings (30,000 dimensions) - Configurable hybrid search weights
Search Examples¶
- Conceptual: "papers about machine learning optimization"
- Specific: "CRISPR gene editing techniques"
- Author-based: "research by Smith et al"
Storage Architecture¶
Documents are stored in three locations:
- PostgreSQL Database - Metadata and document chunks
- File System - Original files and markdown conversions
- Raw files:
/app/data/raw_files/{doc_id}_{filename}
- Markdown:
/app/data/markdown_files/{doc_id}.md
- PostgreSQL with pgvector - Embeddings for search
Processing Performance¶
- GPU Available: Faster PDF processing with Marker
- CPU Only: Falls back to CPU (slower but functional)
- Batch Processing: Use CLI for efficient bulk uploads
- Concurrent Uploads: Multiple files processed in parallel
Troubleshooting¶
Processing Failures¶
Check the processing_error
field in the database:
docker exec maestro-postgres psql -U maestro_user -d maestro_db \
-c "SELECT id, title, processing_error FROM documents WHERE status = 'failed';"
Re-processing Documents¶
Storage Issues¶
Monitor disk usage:
Best Practices¶
- File Preparation
- Ensure PDFs have extractable text (not scanned images)
- Keep files under 50MB for optimal processing
-
Use descriptive filenames
-
Organization
- Create groups for different projects
- Use consistent naming conventions
-
Regular cleanup of unused documents
-
Performance
- Use CLI for bulk uploads
- Process large batches during off-hours
- Monitor storage usage
Next Steps¶
- Uploading Documents - Detailed upload guide
- Document Groups - Organizing collections
- Search Guide - Advanced search techniques
- CLI Commands - Bulk operations