Large, persistent DataFrame in pandas?

Question

I am exploring switching to python and pandas as a long-time SAS user.

However, when running some tests today, I was surprised that python ran out of memory when trying to `pandas.read_csv()` a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.

With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.

Is there something analogous in `pandas`?

I regularly work with large files and do not have access to a distributed computing network.

Xinlong C. · Accepted Answer

Great question — this is a common pain point for SAS users switching to pandas. Here are your main options, roughly in order of ease:1. Read in chunks (built into pandas)
chunks = []
for chunk in pd.read_csv('file.csv', chunksize=50000):
    chunks.append(chunk)
df = pd.concat(chunks)
Useful if you only need to process data in pieces, not hold it all in memory at once.2. Reduce memory usage with dtypes Pandas defaults to 64-bit types. Downcasting can cut memory by 50-75%:
df = pd.read_csv('file.csv', dtype={
    'int_col': 'int32',
    'float_col': 'float32',
    'category_col': 'category'  # huge savings for repeated strings
})
3. Use Polars instead of pandas Polars is a modern pandas alternative that is significantly more memory-efficient and faster. Syntax is different but worth learning if you work with large files regularly.4. Use DuckDB (closest to SAS behavior) DuckDB lets you query CSV files directly without loading into memory — very similar to how SAS handles datasets on disk:
import duckdb
result = duckdb.query("SELECT * FROM 'file.csv' WHERE col1 > 100").df()
5. Use Dask Dask mirrors the pandas API but works lazily on large files:
import dask.dataframe as dd
df = dd.read_csv('file.csv')  # doesn't load into memory
result = df.groupby('col1').mean().compute()  # compute only when needed
6. SQLite or Parquet as your "SAS dataset" Convert your CSVs to Parquet format once — it's compressed, columnar, and you can read only the columns you need:
df = pd.read_csv('file.csv')
df.to_parquet('file.parquet')

# Later, read only what you need
df = pd.read_parquet('file.parquet', columns=['col1', 'col2'])
Bottom line for a SAS user: DuckDB or Parquet is probably your closest equivalent to SAS datasets — data lives on disk, you query what you need. For most workflows, combining dtype optimization with chunked reading or Parquet will solve your memory problems without needing distributed computing.

Large, persistent DataFrame in pandas?

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

suppose you have triangle ABC with sides a, b, c with the given measurements. Draw and solve the triangle. determine the area

If one 2 side and include angle of two triangle are congruent how to prove that the both he triangle are congruent

Mary plans to have a fence built around the triangular garden shown to the right. To the nearest foot, what length of fence will she need to buy?

Geometry Question on Triangles

How can we use AA, SSS, and SAS to prove that triangles are similar?

RECOMMENDED TUTORS

IXL

Rosetta Stone

Education.com

TPT

Vocabulary.com

ABCya

SpanishDictionary.com

Inglés.com

Emmersion

Large, persistent DataFrame in pandas?

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

suppose you have triangle ABC with sides a, b, c with the given measurements. Draw and solve the triangle. determine the area

If one 2 side and include angle of two triangle are congruent how to prove that the both he triangle are congruent

Mary plans to have a fence built around the triangular garden shown to the right. To the nearest foot, what length of fence will she need to buy?

Geometry Question on Triangles

How can we use AA, SSS, and SAS to prove that triangles are similar?

RECOMMENDED TUTORS

find an online tutor