Pandas I/O Cheats¶
TL;DR
Prefer Parquet for speed & types. For CSVs, always set dtypes and date parsing.
Read¶
import pandas as pd
df_csv = pd.read_csv(
"in.csv",
dtype={"account_id": "string", "amount": "float64"},
parse_dates=["txn_date", "posted_date"],
dayfirst=False, # US style
encoding="utf-8",
)
df_xlsx = pd.read_excel("in.xlsx", sheet_name=0, dtype_backend="pyarrow")
df_parq = pd.read_parquet("in.parquet")
Write¶
df_csv.to_csv("out.csv", index=False)
df_parq.to_parquet("out.parquet", index=False)
Common Fixes¶
# normalize headers
df.columns = (
df.columns
.str.strip()
.str.lower()
.str.replace(r"[^a-z0-9]+", "_", regex=True)
)
# coerce numbers w/ visibility
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
# consistent date zone-naive
for c in ["txn_date", "posted_date"]:
df[c] = pd.to_datetime(df[c], errors="coerce").dt.tz_localize(None)
CSV essentials¶
import csv, pandas as pd
# Read with stable types
df = pd.read_csv(
"in.csv",
dtype_backend="pyarrow", # arrow-backed dtypes
na_values=["", "NA", "N/A", "null"], # keep default NA handling + custom
keep_default_na=True,
encoding="utf-8",
parse_dates=["txn_date", "posted_date"],# if present
# thousands="," # uncomment if your numbers use commas
# usecols=["id","amount","txn_date"] # load only needed columns
)
# Write with sane defaults
df.to_csv(
"out.csv",
index=False,
encoding="utf-8",
quoting=csv.QUOTE_MINIMAL,
lineterminator="\n"
)
Large files (streaming)¶
import pandas as pd
reader = pd.read_csv(
"big.csv",
dtype_backend="pyarrow",
usecols=["id","amount","ts"],
parse_dates=["ts"],
chunksize=200_000
)
for i, chunk in enumerate(reader):
# process/clean per chunk here if needed
chunk.to_parquet(f"parquet/part_{i:03}.parquet", index=False) # snappy by default
Parquet essentials¶
# Write
df.to_parquet("out.parquet", index=False, compression="snappy") # or "gzip"
# Read
df2 = pd.read_parquet("out.parquet", dtype_backend="pyarrow")
Simple partitioning idea¶
If you already have a quarter or year column, write separate files per value (as in the chunks example) to simulate a small “dataset.”
Performance & types tips¶
# Categorical for low-cardinality text (saves memory)
df["dept"] = df["dept"].astype("category")
# Ensure true booleans
for c in ["is_active","is_recurring"]:
if c in df: df[c] = df[c].astype("boolean")
- Keep IDs with leading zeros as string (not int).
- Dates: store/export ISO
YYYY-MM-DD; strip timezones for Excel/PQ hand-offs. - Prefer Parquet for back-and-forth with Python/Power BI to preserve types.
Remote I/O (quick notes)¶
# HTTP/HTTPS
pd.read_csv("https://example.com/data.csv")
# S3 (requires s3fs and fsspec installed)
pd.read_parquet("s3://my-bucket/path/file.parquet")
Common pitfalls¶
- Thousand separators in CSV (set
thousands=","on read). - Mixed booleans like
"TRUE","False","0/1"→ cast toboolean. - Locale surprises in PQ: use Using Locale… when importing CSV, or use Parquet.