Paperless-ngx: Go Paperless with Self-Hosted Document Management

Paperless-ngx: Go Paperless with Self-Hosted Document Management

The Problem: Your Scanner is Collecting Dust

You've got a multifunction scanner gathering dust on your shelf, your filing cabinet is overflowing, and you're still searching through PDFs by filename because search doesn't work. Paperless-ngx solves this: it's a self-hosted document management system that OCRs everything, auto-tags documents, and makes them actually searchable.

This post is for homelabbers running Docker who want to ditch cloud scanning services (hello, ScanSnap cloud sync fees) and own their document pipeline end-to-end. On my T5810 with 24GB RAM running Ubuntu 24.04, I process ~150 documents monthly with zero issues.

Prerequisites

  • Docker 26.1.3 or later with docker-compose v2.24+
  • Paperless-ngx 2.11.3 or later (check GitHub releases for current version)
  • Minimum 4GB RAM allocated to containers; 8GB+ recommended for concurrent OCR
  • A multifunction scanner (USB or network-connected) or smartphone with a scanning app
  • PostgreSQL 15+ (we'll run this in Docker)
  • Redis 7.0+ for task queue management

Building Your Docker Compose Stack

Paperless-ngx is a three-tier application: the web frontend, the document consumer (OCR worker), and the database backend. Running it properly means orchestrating all three with proper networking and persistent storage.

Here's my production stack configuration (tested on 2.11.3):

version: '3.8'

services:
  db:
    image: postgres:15.6-alpine
    container_name: paperless-db
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - paperless-db:/var/lib/postgresql/data
    networks:
      - paperless-network
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U paperless"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  redis:
    image: redis:7.2-alpine
    container_name: paperless-redis
    networks:
      - paperless-network
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:2.11.3
    container_name: paperless-web
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    ports:
      - "8000:8000"
    environment:
      PAPERLESS_SECRET_KEY: ${SECRET_KEY}
      PAPERLESS_ALLOWED_HOSTS: ${ALLOWED_HOSTS}
      PAPERLESS_DBENGINE: postgresql
      PAPERLESS_DBHOST: db
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: ${DB_PASSWORD}
      PAPERLESS_REDIS: redis://redis:6379
      PAPERLESS_ADMIN_USER: ${ADMIN_USER}
      PAPERLESS_ADMIN_PASSWORD: ${ADMIN_PASSWORD}
      PAPERLESS_TIME_ZONE: ${TIME_ZONE}
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_TASK_WORKERS: 2
    volumes:
      - paperless-data:/usr/src/paperless/data
      - paperless-media:/usr/src/paperless/media
      - ./consume:/usr/src/paperless/consume
      - ./export:/usr/src/paperless/export
    networks:
      - paperless-network
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/"]
      interval: 30s
      timeout: 10s
      retries: 3

  consumer:
    image: ghcr.io/paperless-ngx/paperless-ngx:2.11.3
    container_name: paperless-consumer
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    environment:
      PAPERLESS_SECRET_KEY: ${SECRET_KEY}
      PAPERLESS_DBENGINE: postgresql
      PAPERLESS_DBHOST: db
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: ${DB_PASSWORD}
      PAPERLESS_REDIS: redis://redis:6379
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_TASK_WORKERS: 2
    volumes:
      - paperless-data:/usr/src/paperless/data
      - paperless-media:/usr/src/paperless/media
      - ./consume:/usr/src/paperless/consume
      - ./export:/usr/src/paperless/export
    networks:
      - paperless-network
    restart: unless-stopped
    command: document_consumer

volumes:
  paperless-db:
  paperless-data:
  paperless-media:

networks:
  paperless-network:
    driver: bridge

Create your `.env` file in the same directory:

SECRET_KEY=$(python3 -c 'from django.core.management.utils import get_random_secret_key; print(get_random_secret_key())')
DB_PASSWORD=your_secure_password_here
ADMIN_USER=admin
ADMIN_PASSWORD=your_admin_password
ALLOWED_HOSTS=localhost,127.0.0.1,paperless.your.domain
TIME_ZONE=America/New_York

Launch the stack with:

docker-compose up -d

Verify all services are healthy:

docker-compose ps

Gotcha #1: The consumer container needs the exact same environment variables as the webserver, not just different command flags. If you skip `PAPERLESS_SECRET_KEY` or database credentials on the consumer, OCR will silently fail without warnings.

Configuring OCR and Scanner Integration

Paperless-ngx uses Tesseract for OCR. The base image includes English, but if you need other languages, you'll need to extend the Docker image. For most homelabs, the default is fine.

To actually feed documents into Paperless, you have three options:

Option 1: Network Upload (Web UI)

Login at `http://localhost:8000` with your admin credentials and drag-and-drop PDFs into the interface. Fast for occasional documents, but tedious at scale.

Configure your scanner's "scan to network folder" feature to write directly to the `./consume` directory we mapped in the compose file. Most multifunction printers support SMB/Cifs shares. On Linux:

sudo apt-get install samba samba-common
sudo mkdir -p /mnt/paperless/consume
sudo chown nobody:nogroup /mnt/paperless/consume
sudo chmod 0777 /mnt/paperless/consume

Then configure `/etc/samba/smb.conf`:

[paperless]
   path = /mnt/paperless/consume
   browsable = yes
   guest ok = yes
   read only = no
   create mask = 0777
   directory mask = 0777

Restart Samba and configure your scanner's network settings to point to your server's IP and this share. Documents appear in Paperless within seconds of the scan completing.

Option 3: Mobile App (Travel/Ad-hoc)

Install Paperless Mobile on your phone and upload documents directly. It handles image straightening and compression before uploading.

Gotcha #2: If you're using the scanner-to-folder approach and your network prints fail, the consume folder probably has permission issues. The paperless container runs as UID 1000; if your Samba share is owned by root or UID 0, documents won't be readable. Always verify folder ownership: `ls -la ./consume`.

Fine-Tuning Auto-Tagging and Matching

Paperless shines when you configure automatic document matching. Instead of manually tagging every receipt, you can create rules that apply tags and set document types automatically.

In the web UI, navigate to Admin → Document Types and create types for your common documents:

  • Invoice
  • Receipt
  • Bank Statement
  • Medical
  • Warranty

Then go to Admin → Correspondence and create matching rules. For example:

Document type: Invoice
Match: If document contains "invoice" or "inv-"
Tags: Add "Financial"

Document type: Receipt
Match: If document contains "receipt" or filename contains "amazon"
Tags: Add "Tax Deductible", "Expense"

These rules execute during OCR, so new documents are tagged automatically without manual intervention.

For serious auto-organization, enable Machine Learning classification in settings (Admin → Settings → Matching Algorithm). Set it to "Neural Network" after you've manually tagged ~50 documents. The model learns your tagging patterns and gets better over time.

Performance Tuning for Your Homelab

By default, Paperless-ngx runs OCR single-threaded, which on slower CPUs can take 30+ seconds per document. In the compose file above, I set `PAPERLESS_TASK_WORKERS: 2`. Adjust this based on available cores:

PAPERLESS_TASK_WORKERS=4    # 4+ core systems
PAPERLESS_TASK_WORKERS=2    # 2-core systems
PAPERLESS_TASK_WORKERS=1    # Raspberry Pi or 1-core VMs

Monitor actual OCR performance after a week of scanning:

docker-compose logs consumer | grep "pages/min"

On my T5810 (Xeon E5-2630 v2, 8 cores), I consistently see 15-20 pages per minute with 4 workers. If you're seeing 3-5 ppm, increase workers or check that the database connection isn't bottlenecked.

For persistent storage, ensure your volumes are on reliable media. If you're using spinning drives, consider moving the database to an SSD:

volumes:
  paperless-db:
    driver_opts:
      type: tmpfs
      device: tmpfs
      o: size=2g  # Use tmpfs for database (requires sufficient RAM)

This trades durability for speed—only do this if you have reliable backups of your PostgreSQL database.

Common Issues and Troubleshooting

Consumer Crashes With "Out of Memory"

Large PDF files can spike memory usage during OCR. Check the consumer logs:

docker-compose logs consumer | tail -50

If you see Java or Tesseract OOM errors, reduce `PAPERLESS_TASK_WORKERS` by half and increase the container memory limit in compose:

consumer:
  ...
  mem_limit: 4g
  memswap_limit: 4g

Scanned Documents Not Appearing

Check three things in order:

# 1. Do files exist in the consume folder?
ls -la ./consume

# 2. Are they readable by the container?
docker-compose exec consumer ls -la /usr/src/paperless/consume

# 3. Check consumer logs for parsing errors
docker-compose logs consumer --follow

If files exist but aren't being processed, restart the consumer and check for stuck tasks:

docker-compose restart consumer
docker-compose logs consumer | grep "Consuming"

OCR Producing Gibberish

Happens with low-resolution scans or non-English text. Set language:

PAPERLESS_OCR_LANGUAGE=deu  # German
PAPERLESS_OCR_LANGUAGE=fra  # French
PAPERLESS_OCR_LANGUAGE=eng+deu  # Multiple languages

Also ensure your scanner is set to 300 DPI minimum. Anything below 200 DPI produces unreliable results.

Web UI Hangs When Searching Large Databases

Happens with 50,000+ documents and insufficient PostgreSQL indexes. Run:

docker-compose exec db psql -U paperless -d paperless -c "ANALYZE;"

If hangs continue, your database is undersized. Consider upgrading PostgreSQL to 16+ or enabling query result caching in Redis.

Next Steps: Integration and Backup

You now have a fully functional self-hosted document management system. What's left:

  • Backup strategy: Use `docker-compose exec db pg_dump` weekly and store backups off-server.
  • Reverse proxy: Put Paperless behind nginx or Traefik with SSL. See my post on Traefik reverse proxy setup for homelab SSL.
  • Scanner integration: If your scanner doesn't support network shares, consider scan-to-paperless, a wrapper that watches a folder and uploads to Paperless via API.
  • Export workflow: Set up monthly exports to cold storage (external drive or S3) as an extra backup tier.

After a month of use, you'll have searchable, tagged documents instead of a filing cabinet. The machine learning classifier gets better monthly. Most importantly, you own the entire pipeline—no vendor lock-in, no scanning fees, no cloud storage subscriptions.

Disclosure: This post contains affiliate links. If you purchase through these links, we may earn a small commission at no extra cost to you. We only recommend services we've tested and trust.

Read more