Paperless-ngx: Go Paperless with Self-Hosted Document Management
The Problem: Your Scanner is Collecting Dust
You've got a multifunction scanner gathering dust on your shelf, your filing cabinet is overflowing, and you're still searching through PDFs by filename because search doesn't work. Paperless-ngx solves this: it's a self-hosted document management system that OCRs everything, auto-tags documents, and makes them actually searchable.
This post is for homelabbers running Docker who want to ditch cloud scanning services (hello, ScanSnap cloud sync fees) and own their document pipeline end-to-end. On my T5810 with 24GB RAM running Ubuntu 24.04, I process ~150 documents monthly with zero issues.
Prerequisites
- Docker 26.1.3 or later with docker-compose v2.24+
- Paperless-ngx 2.11.3 or later (check GitHub releases for current version)
- Minimum 4GB RAM allocated to containers; 8GB+ recommended for concurrent OCR
- A multifunction scanner (USB or network-connected) or smartphone with a scanning app
- PostgreSQL 15+ (we'll run this in Docker)
- Redis 7.0+ for task queue management
Building Your Docker Compose Stack
Paperless-ngx is a three-tier application: the web frontend, the document consumer (OCR worker), and the database backend. Running it properly means orchestrating all three with proper networking and persistent storage.
Here's my production stack configuration (tested on 2.11.3):
version: '3.8'
services:
db:
image: postgres:15.6-alpine
container_name: paperless-db
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- paperless-db:/var/lib/postgresql/data
networks:
- paperless-network
healthcheck:
test: ["CMD-SHELL", "pg_isready -U paperless"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
redis:
image: redis:7.2-alpine
container_name: paperless-redis
networks:
- paperless-network
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:2.11.3
container_name: paperless-web
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
ports:
- "8000:8000"
environment:
PAPERLESS_SECRET_KEY: ${SECRET_KEY}
PAPERLESS_ALLOWED_HOSTS: ${ALLOWED_HOSTS}
PAPERLESS_DBENGINE: postgresql
PAPERLESS_DBHOST: db
PAPERLESS_DBNAME: paperless
PAPERLESS_DBUSER: paperless
PAPERLESS_DBPASS: ${DB_PASSWORD}
PAPERLESS_REDIS: redis://redis:6379
PAPERLESS_ADMIN_USER: ${ADMIN_USER}
PAPERLESS_ADMIN_PASSWORD: ${ADMIN_PASSWORD}
PAPERLESS_TIME_ZONE: ${TIME_ZONE}
PAPERLESS_OCR_LANGUAGE: eng
PAPERLESS_TASK_WORKERS: 2
volumes:
- paperless-data:/usr/src/paperless/data
- paperless-media:/usr/src/paperless/media
- ./consume:/usr/src/paperless/consume
- ./export:/usr/src/paperless/export
networks:
- paperless-network
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/"]
interval: 30s
timeout: 10s
retries: 3
consumer:
image: ghcr.io/paperless-ngx/paperless-ngx:2.11.3
container_name: paperless-consumer
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
environment:
PAPERLESS_SECRET_KEY: ${SECRET_KEY}
PAPERLESS_DBENGINE: postgresql
PAPERLESS_DBHOST: db
PAPERLESS_DBNAME: paperless
PAPERLESS_DBUSER: paperless
PAPERLESS_DBPASS: ${DB_PASSWORD}
PAPERLESS_REDIS: redis://redis:6379
PAPERLESS_OCR_LANGUAGE: eng
PAPERLESS_TASK_WORKERS: 2
volumes:
- paperless-data:/usr/src/paperless/data
- paperless-media:/usr/src/paperless/media
- ./consume:/usr/src/paperless/consume
- ./export:/usr/src/paperless/export
networks:
- paperless-network
restart: unless-stopped
command: document_consumer
volumes:
paperless-db:
paperless-data:
paperless-media:
networks:
paperless-network:
driver: bridgeCreate your `.env` file in the same directory:
SECRET_KEY=$(python3 -c 'from django.core.management.utils import get_random_secret_key; print(get_random_secret_key())')
DB_PASSWORD=your_secure_password_here
ADMIN_USER=admin
ADMIN_PASSWORD=your_admin_password
ALLOWED_HOSTS=localhost,127.0.0.1,paperless.your.domain
TIME_ZONE=America/New_YorkLaunch the stack with:
docker-compose up -dVerify all services are healthy:
docker-compose psGotcha #1: The consumer container needs the exact same environment variables as the webserver, not just different command flags. If you skip `PAPERLESS_SECRET_KEY` or database credentials on the consumer, OCR will silently fail without warnings.
Configuring OCR and Scanner Integration
Paperless-ngx uses Tesseract for OCR. The base image includes English, but if you need other languages, you'll need to extend the Docker image. For most homelabs, the default is fine.
To actually feed documents into Paperless, you have three options:
Option 1: Network Upload (Web UI)
Login at `http://localhost:8000` with your admin credentials and drag-and-drop PDFs into the interface. Fast for occasional documents, but tedious at scale.
Option 2: Scanner to Folder (Recommended for Physical Scanners)
Configure your scanner's "scan to network folder" feature to write directly to the `./consume` directory we mapped in the compose file. Most multifunction printers support SMB/Cifs shares. On Linux:
sudo apt-get install samba samba-common
sudo mkdir -p /mnt/paperless/consume
sudo chown nobody:nogroup /mnt/paperless/consume
sudo chmod 0777 /mnt/paperless/consumeThen configure `/etc/samba/smb.conf`:
[paperless]
path = /mnt/paperless/consume
browsable = yes
guest ok = yes
read only = no
create mask = 0777
directory mask = 0777Restart Samba and configure your scanner's network settings to point to your server's IP and this share. Documents appear in Paperless within seconds of the scan completing.
Option 3: Mobile App (Travel/Ad-hoc)
Install Paperless Mobile on your phone and upload documents directly. It handles image straightening and compression before uploading.
Gotcha #2: If you're using the scanner-to-folder approach and your network prints fail, the consume folder probably has permission issues. The paperless container runs as UID 1000; if your Samba share is owned by root or UID 0, documents won't be readable. Always verify folder ownership: `ls -la ./consume`.
Fine-Tuning Auto-Tagging and Matching
Paperless shines when you configure automatic document matching. Instead of manually tagging every receipt, you can create rules that apply tags and set document types automatically.
In the web UI, navigate to Admin → Document Types and create types for your common documents:
- Invoice
- Receipt
- Bank Statement
- Medical
- Warranty
Then go to Admin → Correspondence and create matching rules. For example:
Document type: Invoice
Match: If document contains "invoice" or "inv-"
Tags: Add "Financial"
Document type: Receipt
Match: If document contains "receipt" or filename contains "amazon"
Tags: Add "Tax Deductible", "Expense"
These rules execute during OCR, so new documents are tagged automatically without manual intervention.
For serious auto-organization, enable Machine Learning classification in settings (Admin → Settings → Matching Algorithm). Set it to "Neural Network" after you've manually tagged ~50 documents. The model learns your tagging patterns and gets better over time.
Performance Tuning for Your Homelab
By default, Paperless-ngx runs OCR single-threaded, which on slower CPUs can take 30+ seconds per document. In the compose file above, I set `PAPERLESS_TASK_WORKERS: 2`. Adjust this based on available cores:
PAPERLESS_TASK_WORKERS=4 # 4+ core systems
PAPERLESS_TASK_WORKERS=2 # 2-core systems
PAPERLESS_TASK_WORKERS=1 # Raspberry Pi or 1-core VMs
Monitor actual OCR performance after a week of scanning:
docker-compose logs consumer | grep "pages/min"On my T5810 (Xeon E5-2630 v2, 8 cores), I consistently see 15-20 pages per minute with 4 workers. If you're seeing 3-5 ppm, increase workers or check that the database connection isn't bottlenecked.
For persistent storage, ensure your volumes are on reliable media. If you're using spinning drives, consider moving the database to an SSD:
volumes:
paperless-db:
driver_opts:
type: tmpfs
device: tmpfs
o: size=2g # Use tmpfs for database (requires sufficient RAM)This trades durability for speed—only do this if you have reliable backups of your PostgreSQL database.
Common Issues and Troubleshooting
Consumer Crashes With "Out of Memory"
Large PDF files can spike memory usage during OCR. Check the consumer logs:
docker-compose logs consumer | tail -50If you see Java or Tesseract OOM errors, reduce `PAPERLESS_TASK_WORKERS` by half and increase the container memory limit in compose:
consumer:
...
mem_limit: 4g
memswap_limit: 4gScanned Documents Not Appearing
Check three things in order:
# 1. Do files exist in the consume folder?
ls -la ./consume
# 2. Are they readable by the container?
docker-compose exec consumer ls -la /usr/src/paperless/consume
# 3. Check consumer logs for parsing errors
docker-compose logs consumer --follow
If files exist but aren't being processed, restart the consumer and check for stuck tasks:
docker-compose restart consumer
docker-compose logs consumer | grep "Consuming"OCR Producing Gibberish
Happens with low-resolution scans or non-English text. Set language:
PAPERLESS_OCR_LANGUAGE=deu # German
PAPERLESS_OCR_LANGUAGE=fra # French
PAPERLESS_OCR_LANGUAGE=eng+deu # Multiple languages
Also ensure your scanner is set to 300 DPI minimum. Anything below 200 DPI produces unreliable results.
Web UI Hangs When Searching Large Databases
Happens with 50,000+ documents and insufficient PostgreSQL indexes. Run:
docker-compose exec db psql -U paperless -d paperless -c "ANALYZE;"If hangs continue, your database is undersized. Consider upgrading PostgreSQL to 16+ or enabling query result caching in Redis.
Next Steps: Integration and Backup
You now have a fully functional self-hosted document management system. What's left:
- Backup strategy: Use `docker-compose exec db pg_dump` weekly and store backups off-server.
- Reverse proxy: Put Paperless behind nginx or Traefik with SSL. See my post on Traefik reverse proxy setup for homelab SSL.
- Scanner integration: If your scanner doesn't support network shares, consider scan-to-paperless, a wrapper that watches a folder and uploads to Paperless via API.
- Export workflow: Set up monthly exports to cold storage (external drive or S3) as an extra backup tier.
After a month of use, you'll have searchable, tagged documents instead of a filing cabinet. The machine learning classifier gets better monthly. Most importantly, you own the entire pipeline—no vendor lock-in, no scanning fees, no cloud storage subscriptions.
Disclosure: This post contains affiliate links. If you purchase through these links, we may earn a small commission at no extra cost to you. We only recommend services we've tested and trust.