Large data work flows using pandas?

Question

I have tried to puzzle out an answer to this question for many months while learning pandas.  I use SAS for my day-to-day work and it is great for it's out-of-core support.  However, SAS is horrible as a piece of software for numerous other reasons.One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets.  I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.My first thought is to use `HDFStore` to hold large datasets on disk and pull only the pieces I need into dataframes for analysis.  Others have mentioned MongoDB as an easier to use alternative.  My question is this:What are some best-practice workflows for accomplishing the following: 1. Loading flat files into a permanent, on-disk database structure 2. Querying that database to retrieve data to feed into a pandas data structure 3. Updating the database after manipulating pieces in pandasReal-world examples would be much appreciated, especially from anyone who uses pandas on "large data".Edit -- an example of how I would like this to work: 1. Iteratively import a large flat-file and store it in a permanent, on-disk database structure.  These files are typically too large to fit in memory. 2. In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory. 3. I would create new columns by performing various operations on the selected columns. 4. I would then have to append these new columns into the database structure.I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.Edit -- Responding to Jeff's questions specifically: 1. I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data.  I rarely append rows, but I do perform many operations that create new columns. 2. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, `if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'`.  The result of these operations is a new column for every record in my dataset. 3. Finally, I would like to append these new columns into the on-disk data structure.  I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model. 4. A typical project file is usually about 1GB.  Files are organized into such a manner where a row consists of a record of consumer data.  Each row has the same number of columns for every record.  This will always be the case. 5. It's pretty rare that I would subset by rows when creating a new column.  However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics.  For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards.  To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on.  When creating new columns, however, I would pull all rows of data and only the columns I need for the operations. 6. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships.  The columns that I explore are usually done in small sets.  For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan.  Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process.  What I'm doing is creating candidate variables that explain the relationship between my data and some outcome.  At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.It is rare that I would ever add rows to the dataset.  I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).

Anonymous A. · Accepted Answer

For this kind of workload, I would not center the workflow around pandas HDFStore or MongoDB. A better modern approach is to use Parquet or DuckDB for persistent on-disk storage, then use pandas only for subsets of the data that fit in memory.

The general workflow would be:

1. Keep the raw flat files as immutable source files.

2. Convert the raw files once into Parquet or a DuckDB table.

3. Use DuckDB to query only the rows and columns needed for a specific analysis.

4. Load those smaller subsets into pandas when in-memory manipulation is useful.

5. Save newly created variables/features as separate Parquet feature tables.

6. Join the base data and feature tables later when creating the final modeling dataset.

This works well because Parquet is columnar, so reading a few columns out of a very wide dataset is efficient. DuckDB can query Parquet files directly, filter rows, select columns, create summary tables, and perform joins without loading the whole dataset into memory. Pandas then becomes the tool for local analysis, not the storage engine.

A useful project structure would be:

raw/

credit_file.csv

data/

base.parquet

features_property.parquet

features_address.parquet

features_derogatory.parquet

modeling_table.parquet

notebooks/

01_profile_property_variables.ipynb

02_create_property_features.ipynb

03_modeling.ipynb

credit_project.duckdb

A typical feature-creation workflow could look like this:

1. Query only the needed columns from the base Parquet file.

2. Create the new feature using SQL or pandas.

3. Save the result as a feature table containing the row identifier and the new feature.

4. Repeat this process for each group of variables.

5. Join all feature tables into a final modeling table at the end.

Example using DuckDB:

import duckdb

con = duckdb.connect("credit_project.duckdb")

con.execute("""

CREATE OR REPLACE TABLE base AS

SELECT *

FROM read_csv_auto('raw/credit_file.csv', sample_size=-1)

""")

con.execute("""

COPY base TO 'data/base.parquet' (FORMAT PARQUET)

""")

con.execute("""

COPY (

SELECT

consumer_id,

CASE

WHEN var1 > 2 THEN 'A'

WHEN var2 = 4 THEN 'B'

ELSE 'C'

END AS compound_feature_1

FROM 'data/base.parquet'

) TO 'data/features_property.parquet' (FORMAT PARQUET)

""")

df = con.execute("""

SELECT

b.outcome,

f.compound_feature_1,

COUNT(*) AS n

FROM 'data/base.parquet' b

JOIN 'data/features_property.parquet' f USING (consumer_id)

WHERE b.line_of_business = 'Retail'

GROUP BY b.outcome, f.compound_feature_1

ORDER BY f.compound_feature_1, b.outcome

""").df()

print(df)

The main idea is to avoid repeatedly mutating one huge table every time a new column is created. Instead, keep the original dataset stable and store derived variables in separate feature tables keyed by an ID such as consumer_id. This is usually easier, safer, and more scalable than trying to append columns directly to an HDFStore.

Large data work flows using pandas?

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

Hello, I am trying to open a file in read mode using Python.

Assume the days of the week are numbered 0,1,2,3,4,5,6 from Sunday to Saturday.(using python)

Nested Functions

I have to create a store receipt in Python 3.4.1. It has to allow unlimited input of items, show subtotal, tax, and total. Please help.

How do i go about seperating my numerical data into fuzzy sections and putting it into code in Python ?

RECOMMENDED TUTORS

IXL

Rosetta Stone

Education.com

TPT

Vocabulary.com

ABCya

SpanishDictionary.com

Inglés.com

Emmersion

Large data work flows using pandas?

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

Hello, I am trying to open a file in read mode using Python.

Assume the days of the week are numbered 0,1,2,3,4,5,6 from Sunday to Saturday.(using python)

Nested Functions

I have to create a store receipt in Python 3.4.1. It has to allow unlimited input of items, show subtotal, tax, and total. Please help.

How do i go about seperating my numerical data into fuzzy sections and putting it into code in Python ?

RECOMMENDED TUTORS

find an online tutor