TrueFileSize Editorial·May 6, 2025·10 min read

MIME Type Detection in Production — Lessons from Processing 10 Million Files

The Content-Type header your browser sends with an upload is the single biggest lie in web development. It's not malicious — the browser just guesses based on file extension, and it guesses wrong constantly. I've watched Chrome send application/octet-stream for a perfectly valid PDF, Firefox label a .csv as text/plain while Chrome called the same file application/vnd.ms-excel, and Safari mark a .webp as image/jpeg because macOS didn't have a file association for it.

If your upload pipeline trusts that header, you don't have a validation layer. You have a suggestion box.

We process around 10 million file uploads a month across our infrastructure. This post isn't about the theory of MIME detection — you can read about magic bytes and the three-layer approach for that. This is about what happens when you deploy those techniques at scale and discover all the ways they break.

The Evidence: Browsers Can't Agree on Anything

Here's a real table from our logs. Same 8 files, uploaded through each browser's <input type="file"> element. The Content-Type header each browser sent:

| File | Actual Type | Chrome 124 | Firefox 125 | Safari 17.4 | |------|-------------|-----------|-------------|-------------| | report.csv | text/csv | text/csv | text/csv | text/csv | | data.csv (no BOM) | text/csv | application/vnd.ms-excel | text/plain | text/plain | | photo.heic | image/heic | image/heic | application/octet-stream | image/heic | | scan.pdf | application/pdf | application/pdf | application/pdf | application/pdf | | archive.tar.gz | application/gzip | application/x-gzip | application/gzip | application/x-gzip | | font.woff2 | font/woff2 | font/woff2 | application/octet-stream | application/octet-stream | | model.glb | model/gltf-binary | application/octet-stream | application/octet-stream | application/octet-stream | | data.parquet | application/vnd.apache.parquet | application/octet-stream | application/octet-stream | application/octet-stream |

Three files where all browsers agree. Three where there's variance. And two where every browser just shrugs and says "bytes, I guess."

That second CSV row is my favorite. Chrome sees the .csv extension, checks the Windows registry, finds Microsoft Excel's association, and sends application/vnd.ms-excel. Firefox on Linux has no such association and goes with text/plain. Same file. Same bytes. Totally different MIME type. If your backend switches behavior based on Content-Type, you've just introduced a bug that only manifests on specific OS/browser combinations.

Magic Bytes Detection: The Library Shootout

OK, so the browser header is unreliable. You need server-side detection. Three main options:

`file-type` (Node.js)

The go-to for JavaScript backends. It reads the first 4100 bytes and matches against a built-in signature database. Pure JavaScript, no native dependencies.

import { fileTypeFromBuffer, fileTypeFromFile } from 'file-type';

async function detectMime(filePath) {
  const result = await fileTypeFromFile(filePath);

  if (!result) {
    // No magic bytes match — could be CSV, JSON, plain text, or unknown binary
    // Don't panic yet (we'll handle this below)
    return { mime: 'application/octet-stream', ext: 'bin', confident: false };
  }

  return { mime: result.mime, ext: result.ext, confident: true };
}

`python-magic` (Python wrapper around libmagic)

Thin wrapper around the system's libmagic library — the same engine behind the file command on Linux/macOS. Much broader signature database than file-type, but requires the native library installed.

import magic

def detect_mime(file_path):
    # python-magic delegates to libmagic — same as running `file --mime-type`
    detected = magic.from_file(file_path, mime=True)

    # libmagic returns text/plain for a LOT of things
    # CSV, JSON, XML, YAML, .env files, SQL dumps — all "text/plain"
    if detected == 'text/plain':
        return refine_text_type(file_path, detected)

    return detected


def refine_text_type(file_path, fallback):
    """libmagic says text/plain — let's do better."""
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        head = f.read(8192)  # First 8KB is enough to sniff

    # Order matters — check the more specific formats first
    if head.lstrip().startswith('{') or head.lstrip().startswith('['):
        try:
            import json
            json.loads(head)
            return 'application/json'
        except json.JSONDecodeError:
            pass  # Looks like JSON but isn't — leave it

    if looks_like_csv(head):
        return 'text/csv'

    # TODO: add XML detection (starts with <?xml or <root>)
    # TODO: YAML is hard to sniff reliably — skipping for now

    return fallback

`libmagic` directly (C, via FFI)

If you're using Go, Rust, or another language, you can bind to libmagic directly. Fastest option, but you own the dependency management.

Benchmarks: How They Actually Perform

We ran each library against 50,000 files from our upload logs (mix of images, documents, archives, and weird stuff):

| Library | Avg per file | P99 latency | Correct detection | Coverage (has a match) | |---------|-------------|-------------|-------------------|----------------------| | file-type 19.x (JS) | 0.12ms | 0.8ms | 99.2% (of matched files) | 78% of all files | | python-magic 0.4.x | 0.08ms | 0.4ms | 98.8% | 94% of all files | | libmagic direct (C) | 0.03ms | 0.15ms | 98.8% | 94% of all files |

The speed difference barely matters unless you're processing thousands of files per second. What does matter is coverage. file-type only knows ~300 file signatures. libmagic knows ~2000+. That gap shows up in practice — file-type returns undefined for Parquet files, .woff2 fonts, .glb 3D models, and a bunch of niche document formats that libmagic handles fine.

But here's the kicker: both libraries struggle with the same category of file. Text.

The Hard Edge Cases Nobody Warns You About

Text-based formats: CSV, JSON, plain text, YAML, XML

These are the bane of MIME detection. They have no magic bytes. A CSV file starts with... text. A JSON file starts with { or [. A YAML file starts with --- or just a key. There's nothing unique in the first bytes to identify them.

libmagic will call all of them text/plain. file-type will return undefined. Both are technically correct and both are completely useless if your application needs to distinguish between a CSV upload and a JSON config file.

You have to sniff the content yourself. Here's what we ended up building:

import { parse as csvParse } from 'csv-parse/sync';

function sniffTextFormat(buffer) {
  const text = buffer.toString('utf-8', 0, Math.min(buffer.length, 8192));
  const trimmed = text.trimStart();

  // JSON — fast check before expensive parse
  if (trimmed[0] === '{' || trimmed[0] === '[') {
    try {
      JSON.parse(text);
      return 'application/json';
    } catch {
      // Started with { but isn't valid JSON — maybe NDJSON?
      const firstLine = trimmed.split('\n')[0];
      try {
        JSON.parse(firstLine);
        return 'application/x-ndjson'; // newline-delimited JSON
      } catch {
        // Not JSON either — fall through
      }
    }
  }

  // XML — reasonably safe to check
  if (trimmed.startsWith('<?xml') || trimmed.startsWith('<')) {
    // FIXME: this also matches HTML — need better heuristic
    // For now, we check if it's well-formed enough
    if (trimmed.includes('xmlns') || trimmed.startsWith('<?xml')) {
      return 'application/xml';
    }
  }

  // CSV — the hardest one. Our heuristic: if csv-parse can parse
  // the first 5 lines without error and finds consistent column counts
  try {
    const rows = csvParse(text, { to: 5, relax_column_count: false });
    if (rows.length >= 2 && rows[0].length >= 2) {
      return 'text/csv';
    }
  } catch {
    // Not CSV-shaped
  }

  return 'text/plain'; // Give up
}

Is this perfect? Not even close. A file that starts with [ could be JSON or it could be someone's bracket-heavy notes. A file with comma-separated values could be CSV or could be a plain text list. We've accepted about a 3% misclassification rate on text files and built our UI to let users correct the detected type when it matters.

Grab some sample CSV files and throw them at your detector. You'll quickly find the edge cases — CSVs with quoted commas, single-column CSVs (which look identical to plain text), TSV files that your CSV parser chokes on.

Polyglot files: Valid as two types simultaneously

We've covered polyglots elsewhere — the short version is that a file can be simultaneously valid as JPEG and JavaScript, or PDF and ZIP. Magic bytes detect the first type they match. A JPEG/JS polyglot starts with FF D8 FF, so every library correctly says "JPEG." The embedded JavaScript is invisible to detection.

For this post, the production-relevant lesson is: detection tells you what a file claims to be. Re-encoding (for images) or sandboxed parsing (for documents) tells you what it actually is. Don't confuse detection with validation.

Encrypted and compressed files

Encrypted files are opaque by design. An AES-encrypted PDF is just random-looking bytes — no %PDF header, no recognizable structure. libmagic will say application/octet-stream. file-type will say undefined. Both correct. Both unhelpful.

Password-protected ZIPs are a partial exception — they still have the PK header and ZIP central directory. The contents are encrypted, but the container is recognizable. Same for encrypted .docx files (they're still ZIPs internally, just with encrypted XML entries).

Our production rule: if magic bytes return nothing and the file extension claims it's something specific, we accept the extension as a hint but tag the file as detection: extension-only in our metadata. Downstream processors know to be more cautious with those files.

The Counterargument: When Simple Detection Is Enough

I've spent this whole post making MIME detection sound terrifying. Let me push back on myself.

If your application only accepts images — JPEGs, PNGs, maybe WebPs — magic bytes detection is close to bulletproof. These formats have unambiguous headers, file-type and libmagic both handle them perfectly, and re-encoding with sharp gives you a safety net against polyglots. You don't need the text-sniffing pipeline. You don't need to worry about CSV-vs-JSON ambiguity. Just validate magic bytes + re-encode. Done.

The complexity explodes when your application accepts "documents" broadly — PDFs, spreadsheets, CSVs, JSON, XML, archives. That's when you hit every edge case I described. And frankly, most applications do accept that range eventually. The product roadmap always expands.

One more nuance: performance. I've seen teams build elaborate detection pipelines with multiple library calls, content sniffing, and deep inspection — then deploy it behind an upload endpoint that handles 50 files per day. That's over-engineering. At 50 files/day, you could validate by hand and still have time for lunch.

Our 10-million-files-a-month pipeline justifies the complexity. Yours might not. Be honest about your scale.

The Verdict: A Production Decision Tree

After two years of iteration, here's the decision tree we actually use. Not the ideal one. The one that works.

Is the file binary (magic bytes detected)?
├── Yes
│   ├── Does magic bytes match extension?
│   │   ├── Yes → Accept. Confidence: high.
│   │   └── No → Flag as suspicious.
│   │       Extension says .pdf but bytes say JPEG?
│   │       → Reject or quarantine. Log it.
│   │       Extension says .jpg but bytes say .png?
│   │       → Probably fine (user renamed), accept but normalize extension.
│   └── Is it an image?
│       └── Re-encode with sharp. Strips polyglot payloads.
│           Serve the re-encoded version, not the original.
│
├── No (file-type returned undefined)
│   ├── Is the content valid UTF-8?
│   │   ├── Yes → Run text sniffing (JSON? CSV? XML?)
│   │   │   ├── Detected → Use sniffed type. Confidence: medium.
│   │   │   └── Unknown → Fall back to text/plain.
│   │   └── No → application/octet-stream. Unknown binary.
│   │       Accept extension as hint, tag as low-confidence.
│   └── Is the file empty (0 bytes)?
│       └── Reject. Always.

And the simplified code version, for those who want to copy-paste:

import { fileTypeFromBuffer } from 'file-type';

async function detectWithConfidence(buffer, originalFilename) {
  const ext = originalFilename.split('.').pop()?.toLowerCase();

  // Step 1: Magic bytes
  const detected = await fileTypeFromBuffer(buffer);

  if (detected) {
    // Got a binary match — cross-check extension
    const extensionMismatch = ext && !detected.ext.includes(ext)
      && ext !== detected.ext;

    return {
      mime: detected.mime,
      ext: detected.ext,
      confidence: extensionMismatch ? 'suspicious' : 'high',
      extensionMismatch,
    };
  }

  // Step 2: No magic bytes — try text sniffing
  if (isUtf8(buffer)) {
    const sniffed = sniffTextFormat(buffer);
    return {
      mime: sniffed,
      ext: ext || 'txt',
      confidence: sniffed === 'text/plain' ? 'low' : 'medium',
      extensionMismatch: false,
    };
  }

  // Step 3: Unknown binary
  return {
    mime: 'application/octet-stream',
    ext: ext || 'bin',
    confidence: 'none',
    extensionMismatch: false,
  };
}

function isUtf8(buffer) {
  try {
    // TextDecoder throws on invalid UTF-8 in strict mode
    new TextDecoder('utf-8', { fatal: true }).decode(buffer.slice(0, 1024));
    return true;
  } catch {
    return false;
  }
}

import magic
from pathlib import Path

def detect_with_confidence(file_path: str, original_filename: str) -> dict:
    """Production MIME detection with confidence scoring."""
    ext = Path(original_filename).suffix.lower().lstrip('.')
    detected = magic.from_file(file_path, mime=True)

    file_size = Path(file_path).stat().st_size
    if file_size == 0:
        return {'mime': None, 'confidence': 'reject', 'reason': 'empty file'}

    # libmagic always returns *something* — check if it's useful
    if detected not in ('text/plain', 'application/octet-stream'):
        return {
            'mime': detected,
            'ext': ext,
            'confidence': 'high',
        }

    if detected == 'text/plain':
        refined = refine_text_type(file_path, detected)
        return {
            'mime': refined,
            'ext': ext,
            'confidence': 'medium' if refined != 'text/plain' else 'low',
        }

    # application/octet-stream — libmagic doesn't know either
    # Trust extension as a hint but mark confidence accordingly
    return {
        'mime': detected,
        'ext': ext,
        'confidence': 'extension-only',
    }

Three Failure Stories (Because I Learn Best from Other People's Mistakes)

The Great CSV Incident. We added CSV support to our import pipeline. Detection returned text/plain for every CSV because — no magic bytes. So we checked the extension. .csv? It's a CSV. Except a user uploaded a tab-separated file named export.csv. Our CSV parser choked. Their data import failed silently — rows got merged, columns shifted. We didn't catch it for two days because the error was in the data, not in the pipeline. Now we parse the first 10 rows and validate column consistency before accepting any CSV, regardless of what detection says.

The HEIC That Wasn't. An iPhone user uploaded 40 .heic photos. Our file-type version at the time didn't have HEIC support (it was added in v17). Detection returned undefined. Our fallback trusted the extension. The processing pipeline tried to resize them with sharp, which also didn't support HEIC at that point. Forty broken thumbnails. The fix was upgrading file-type, adding HEIC to sharp via the heif-dec plugin, and — this is the real lesson — testing with actual iPhone photos, not just test fixtures we generated ourselves. (TrueFileSize has sample images in every format including HEIC, which would've caught this.)

The PDF That Was a ZIP. A .pdf file that was actually a ZIP archive containing a PDF. The magic bytes said PK (ZIP). Our validator flagged it as "extension mismatch: claims PDF, is ZIP." We rejected it. User complained. Turns out some government form systems distribute PDFs inside ZIP wrappers and name the outer file .pdf. We had to add an exception for "ZIP containing a single PDF" to our mismatch handler. Still not sure if that was the right call, but the user stopped complaining.

The Boring Conclusion

MIME detection isn't a solved problem. It's a spectrum of confidence. Binary files with magic bytes? High confidence. Text files with ambiguous formats? Medium at best. Encrypted files or unknown binaries? You're guessing, and you should be honest about it.

Build your pipeline to carry that confidence score forward. Don't make binary accept/reject decisions on low-confidence detection. Let the user confirm when you're unsure. And test with wrong-extension files, text edge cases, and real uploads from real devices — not just the happy path.

Your browser's Content-Type header is still a lie, though. That part's not nuanced at all.

Need cloud storage for large files? Free 10GB + unlimited egress.Cloudflare R2 →

Affordable cloud storage — 10GB free, $6/TB after.Backblaze B2 →

Deploy your app — $200 free credit for new accounts.DigitalOcean →