Back to all articles
Document PipelinesCompress PDFBatch ProcessingDocument PipelineLegal ArchivesS3 Storage

How to Build a Batch-Compression Pipeline That Shrinks Your Legal Archive by 70%

Compressing one PDF is a utility. Compressing 5,000 archived contracts so your AWS S3 costs stop compounding is an infrastructure problem. Here is the four-lever framework and a visual pipeline that automates it at scale.

Lyriryl
Lyriryl
Founder & Engineer
5 min read
How to Build a Batch-Compression Pipeline That Shrinks Your Legal Archive by 70%

The direct answer: a batch-compression pipeline targeting 150 DPI image downsampling and JPEG quality 85 will reduce a typical legal archive of scanned contracts by 65–75% in total storage size — with zero visible degradation at screen resolution. The text layers, digital signatures, and document metadata are untouched. The compressor only touches embedded scan images, which constitute 60–85% of archive file size.

If you are paying AWS S3 storage costs on 5,000+ uncompressed scanned PDFs, you are paying for pixels that no human eye will ever see.

Why Archives Bloat: The Scanner Default Problem

Legal and HR document archives grow through a predictable pattern. A paralegal scans a 40-page executed contract on an office scanner set to its default 600 DPI color output. The resulting PDF is 85MB. Nobody changes the scanner default because nobody owns that decision. Over five years, the archive accumulates 5,000 documents averaging 60MB each — 300GB of storage that costs $6.90/month on S3 Standard today, and more next year when the archive doubles.

The 600 DPI scan setting is for print reproduction, not digital archiving. The court does not require 600 DPI. Your document management system does not require 600 DPI. Your lawyers reading contracts on a 2560×1440 monitor cannot physically perceive the difference between a 600 DPI and a 200 DPI scan.

The archive is overprovisioned by a factor of 3–4× because nobody built a compression step into the intake pipeline.

The Four Compression Levers (and Which Ones to Use)

A PDF is a container format. Its file size is dominated by these components:

Object Type% of Archive File SizeCompression Approach
Embedded scan images (JPEG/TIFF)60–85%Downsample + re-encode — highest impact
Embedded fonts5–15%Subset to used glyphs only
Vector line art1–5%Already compact — negligible gain
Text content streams1–3%Flate re-compression — minor gain

Lever 1: Image Downsampling (The Primary Driver)

Downsampling from 600 DPI to 200 DPI reduces pixel count by ~89% (pixel area scales with the square of DPI ratio: (200/600)² = 0.111). For a 60MB scanned contract, this single operation produces an ~8MB file before any re-encoding.

The perceptual threshold for legal documents: 150 DPI is the safe floor for screen-only archives. For documents that must remain printable at full scale (signed originals, court-filed exhibits), 200–250 DPI is the correct target. At 300 DPI, the file is still 70% smaller than the 600 DPI original and indistinguishable in any practical use case.

Lever 2: JPEG Re-encoding Quality (The Risk Lever)

Most free compressors use JPEG quality 60–70 to produce impressive-looking file size metrics. At quality 70, block artifacts become visible in high-contrast areas — signatures, stamps, table borders. At quality 60, scanned text begins to look soft.

The correct value is quality 85. At this level, JPEG compression artifacts fall below the human visual detection threshold for document content. The file is 40–50% smaller than an uncompressed TIFF page with no visible degradation.

Lever 3: Font Subsetting (Free Savings)

For digitally-created PDFs (Word exports, InDesign), embedded fonts include every glyph in the typeface — even characters not present in the document. A full font embedding is ~150KB; subsetting it to the 40–60 characters actually used drops it to ~15KB. Zero visual impact, free savings.

Lever 4: Flate Re-compression (Marginal, Always Safe)

Re-running zlib compression on content streams saves 5–10% on PDFs exported from Microsoft Office without optimization flags. On already-optimized PDFs, the gain is under 2%. Always safe to apply; never the primary driver.

Building the Batch Pipeline

Compressing 5,000 files manually — even with a fast desktop tool — takes days. A document pipeline automates the entire operation:

[Node: Cloud Storage Input]     ← Connect to S3 bucket / Google Drive folder
  → [Node: Compress PDF]        ← 200 DPI, JPEG quality 85, font subsetting
  → [Node: If/Else]             ← Route: compressed < 10MB → archive tier / else → flag for review
  → [Node: Storage Output]      ← Write back to S3 / Drive with original filename

The ConvertUniverse Compress node applies all four levers in a single operation with configurable quality presets. The pipeline runs batch jobs — 500 files in one execution — with progress logged per document. Files that fail compression (corrupt source, encrypted without owner password) are routed to a separate error folder, not silently skipped.

The output: a processed archive where the average contract is 8–12MB instead of 60–85MB. A 300GB archive becomes 40–50GB. At S3 Standard pricing, that is $6.90/month → $1.15/month — before you account for transfer cost reductions on every document retrieval operation.

Downsampling to 150 DPI solves the single-file problem. But if you are trying to automate this across thousands of archived files using Zapier, you will hit a task-charge wall immediately — 5,000 documents × 1 compression task each = 5,000 tasks charged against your monthly plan. Read: The Zapier Task Tax breakdown →

Run a Batch Compression Test

Core Conversion Engine

Powered by 6GB Docker Infrastructure

1. Drop Heavy FileUp to 2GB supported
2. Deep ParsingOCR & Document Mapping
3. High-Fidelity OutputPixel-perfect conversion

Ready to test the engine?

No signup required. 100% free.

Upload a sample of your archive above to test the compression output. For a production pipeline against your full archive, join the waitlist below — the batch workflow handles S3 and Google Drive folder inputs directly.

Coming Soon

Automate Your Whole Document Pipeline

Stop doing manual tasks. Join the waitlist to get early access to our node-based visual workflow builder.

Share this article

Share:

More from the blog

Keep reading our engineering insights.

View All