Xinlong C. answered 04/04/26
Yale Data Scientist | Python, SQL & ML Tutor
Great question — this is a common pain point for SAS users switching to pandas. Here are your main options, roughly in order of ease:
1. Read in chunks (built into pandas)
Useful if you only need to process data in pieces, not hold it all in memory at once.
2. Reduce memory usage with dtypes Pandas defaults to 64-bit types. Downcasting can cut memory by 50-75%:
3. Use Polars instead of pandas Polars is a modern pandas alternative that is significantly more memory-efficient and faster. Syntax is different but worth learning if you work with large files regularly.
4. Use DuckDB (closest to SAS behavior) DuckDB lets you query CSV files directly without loading into memory — very similar to how SAS handles datasets on disk:
5. Use Dask Dask mirrors the pandas API but works lazily on large files:
6. SQLite or Parquet as your "SAS dataset" Convert your CSVs to Parquet format once — it's compressed, columnar, and you can read only the columns you need:
Bottom line for a SAS user: DuckDB or Parquet is probably your closest equivalent to SAS datasets — data lives on disk, you query what you need. For most workflows, combining dtype optimization with chunked reading or Parquet will solve your memory problems without needing distributed computing.