Question 1

What is Apache Parquet?

Accepted Answer

Apache Parquet is an open-source columnar storage format designed for efficient analytics on large datasets. Unlike CSV (row-based), Parquet stores data column by column — enabling column pruning, better compression (4-10x smaller than CSV), and predicate pushdown. It's the default format for data lakes on S3/GCS/ADLS, used by Spark, BigQuery, Athena, Snowflake, and every major analytics platform.

Question 2

Parquet vs CSV — Which is better for data?

Accepted Answer

Parquet is dramatically better for analytics: (1) 4-10x smaller files via columnar compression. (2) Column pruning — read only the columns you need. (3) Predicate pushdown — skip irrelevant row groups using min/max stats. (4) Schema enforcement — typed columns prevent data errors. CSV is better for human readability and simple data exchange. A 10GB CSV becomes ~1GB Parquet with 10-100x faster query performance.

Question 3

Parquet vs Avro — What is the difference?

Accepted Answer

Parquet is columnar (optimized for read-heavy analytics — SELECT specific columns). Avro is row-based (optimized for write-heavy streaming and full-record access). Use Parquet for data warehouses, BI tools, and ad-hoc queries. Use Avro for Kafka event streams, data ingestion, and schema evolution. Many pipelines use Avro for ingestion → convert to Parquet for analytics.

Question 4

How to read Parquet file?

Accepted Answer

Python: pd.read_parquet('data.parquet') or polars.read_parquet('data.parquet'). SQL: DuckDB — SELECT * FROM 'data.parquet'. Spark: spark.read.parquet('s3://bucket/data.parquet'). CLI: parquet-tools head data.parquet or pqrs schema data.parquet. Cloud: BigQuery, Athena, and Snowflake query Parquet directly from S3/GCS.

Question 5

Parquet compression types — Snappy vs GZIP vs ZSTD?

Accepted Answer

Snappy (default): fastest decompression (5-10x faster than GZIP), ~20% larger files. Best for interactive queries in Spark/DuckDB/Presto. GZIP: best compression ratio but slowest. Best for cold storage and bandwidth-limited transfers. ZSTD: best of both worlds — near-GZIP compression with near-Snappy speed. Supported in Spark 3+, Arrow, DuckDB. Our sample files include Snappy, GZIP, and uncompressed variants.

Question 6

How do I convert CSV to Parquet?

Accepted Answer

Python: pd.read_csv('data.csv').to_parquet('data.parquet', engine='pyarrow'). DuckDB: COPY (SELECT * FROM 'data.csv') TO 'data.parquet'. Spark: spark.read.csv('data.csv').write.parquet('output/'). Always specify column dtypes when converting to avoid all-string schemas in the output Parquet file.

Feature	Parquet	CSV	JSON
Storage layout	Columnar	Row-based	Row-based
File size (1M rows)	~50 MB	~200 MB	~400 MB
Column pruning	Yes (read only needed cols)	No (read all)	No (read all)
Schema enforcement	Yes (typed columns)	No (all strings)	Partial
Predicate pushdown	Yes (row group stats)	No	No
Human readable	No (binary)	Yes	Yes
Best for	Analytics, data lakes, ML	Data exchange, imports	APIs, configs

Codec	Ratio	Speed	When to use
Snappy	Good	Very fast	Default — best balance (Spark, DuckDB)
GZIP	Best	Slow	Long-term storage, bandwidth-limited
ZSTD	Best	Fast	Modern alternative to GZIP (Spark 3+)
None	1:1	Fastest	Testing, already-compressed data

Full name	Apache Parquet
Extension	.parquet
Type	Columnar binary storage format
Magic bytes	PAR1 (header and footer)
Compression	Snappy (default), GZIP, ZSTD, LZ4, Brotli, None
Encoding	Dictionary, RLE, Delta, Bit-packing
Nested types	Dremel-style repetition/definition levels
Developed by	Twitter + Cloudera (2013), Apache project

Sample Parquet File Download — Free Apache Parquet for Testing

Use cases for sample Parquet files

Parquet vs CSV vs JSON for analytics

How to read and write Parquet files

Parquet compression codecs

Technical specifications

Frequently Asked Questions

Other data formats

Related reading

Mocking REST APIs with JSON Fixtures

Sample JSON Data for API Testing and Mocking

Seeding Test Databases with Sample Data — SQL, JSON, CSV