Sample Parquet File Download — Free Apache Parquet for Testing
Download free Apache Parquet example files from 100KB to 50MB — Snappy, GZIP, and uncompressed variants. These Parquet test files are built for data engineers and analysts working with Spark, Pandas, BigQuery, Athena, DuckDB, and Snowflake. Use them as parquet files for testing data lake ingestion, ETL pipelines, and columnar query performance.
sample-100kb.parquet
1,100 rows · SNAPPY
sample-500kb.parquet
3,300 rows · SNAPPY
sample-1mb.parquet
5,000 rows · GZIP
sample-5mb.parquet
22,000 rows · SNAPPY
sample-10mb.parquet
44,000 rows · SNAPPY
sample-50mb.parquet
150,000 rows · SNAPPY
sample-uncompressed.parquet
2,000 rows · NONE
sample-gzip.parquet
3,000 rows · GZIP
Use cases for sample Parquet files
- Testing Parquet readers (pyarrow, DuckDB, Spark, pandas)
- Benchmarking Parquet vs CSV read performance
- Testing data lake ingestion pipelines (S3, GCS, ADLS)
- Verifying Parquet schema evolution and compatibility
- Testing BI tool Parquet import (Tableau, Power BI, Metabase)
- Validating Snappy vs GZIP compression handling
Parquet vs CSV vs JSON for analytics
| Feature | Parquet | CSV | JSON |
|---|---|---|---|
| Storage layout | Columnar | Row-based | Row-based |
| File size (1M rows) | ~50 MB | ~200 MB | ~400 MB |
| Column pruning | Yes (read only needed cols) | No (read all) | No (read all) |
| Schema enforcement | Yes (typed columns) | No (all strings) | Partial |
| Predicate pushdown | Yes (row group stats) | No | No |
| Human readable | No (binary) | Yes | Yes |
| Best for | Analytics, data lakes, ML | Data exchange, imports | APIs, configs |
How to read and write Parquet files
# Python (pandas + pyarrow — most common)
import pandas as pd
df = pd.read_parquet('data.parquet')
df.to_parquet('output.parquet', engine='pyarrow')
# Python (polars — faster alternative)
import polars as pl
df = pl.read_parquet('data.parquet')
# DuckDB (SQL on Parquet — zero copy)
duckdb.sql("SELECT * FROM 'data.parquet' WHERE age > 30")
duckdb.sql("COPY (SELECT * FROM my_table) TO 'out.parquet'")
# Apache Spark
df = spark.read.parquet("s3://bucket/data.parquet")
# CLI inspection (parquet-tools / pqrs)
parquet-tools schema data.parquet
parquet-tools head data.parquet
pqrs schema data.parquetParquet compression codecs
| Codec | Ratio | Speed | When to use |
|---|---|---|---|
| Snappy | Good | Very fast | Default — best balance (Spark, DuckDB) |
| GZIP | Best | Slow | Long-term storage, bandwidth-limited |
| ZSTD | Best | Fast | Modern alternative to GZIP (Spark 3+) |
| None | 1:1 | Fastest | Testing, already-compressed data |
Technical specifications
| Full name | Apache Parquet |
| Extension | .parquet |
| Type | Columnar binary storage format |
| Magic bytes | PAR1 (header and footer) |
| Compression | Snappy (default), GZIP, ZSTD, LZ4, Brotli, None |
| Encoding | Dictionary, RLE, Delta, Bit-packing |
| Nested types | Dremel-style repetition/definition levels |
| Developed by | Twitter + Cloudera (2013), Apache project |
Frequently Asked Questions
Other data formats
Related reading
Mocking REST APIs with JSON Fixtures
Fast frontend iteration without a backend. MSW, json-server, and sample fixtures for users, products, and nested objects. Copy-paste examples.
Sample JSON Data for API Testing and Mocking
Free sample JSON files for testing REST APIs. Users, products, nested objects, GeoJSON, and API response wrappers with code examples.
Seeding Test Databases with Sample Data — SQL, JSON, CSV
How to seed development and staging databases using sample SQL dumps, JSON files, and CSV imports from TrueFileSize. Covers PostgreSQL, MySQL, SQLite, MongoDB, and Prisma.