Joshua K. answered 17h
For this kind of workload, I would not center the workflow around pandas HDFStore or MongoDB. A better modern approach is to use Parquet or DuckDB for persistent on-disk storage, then use pandas only for subsets of the data that fit in memory.
The general workflow would be:
1. Keep the raw flat files as immutable source files.
2. Convert the raw files once into Parquet or a DuckDB table.
3. Use DuckDB to query only the rows and columns needed for a specific analysis.
4. Load those smaller subsets into pandas when in-memory manipulation is useful.
5. Save newly created variables/features as separate Parquet feature tables.
6. Join the base data and feature tables later when creating the final modeling dataset.
This works well because Parquet is columnar, so reading a few columns out of a very wide dataset is efficient. DuckDB can query Parquet files directly, filter rows, select columns, create summary tables, and perform joins without loading the whole dataset into memory. Pandas then becomes the tool for local analysis, not the storage engine.
A useful project structure would be:
raw/
credit_file.csv
data/
base.parquet
features_property.parquet
features_address.parquet
features_derogatory.parquet
modeling_table.parquet
notebooks/
01_profile_property_variables.ipynb
02_create_property_features.ipynb
03_modeling.ipynb
credit_project.duckdb
A typical feature-creation workflow could look like this:
1. Query only the needed columns from the base Parquet file.
2. Create the new feature using SQL or pandas.
3. Save the result as a feature table containing the row identifier and the new feature.
4. Repeat this process for each group of variables.
5. Join all feature tables into a final modeling table at the end.
Example using DuckDB:
import duckdb
con = duckdb.connect("credit_project.duckdb")
con.execute("""
CREATE OR REPLACE TABLE base AS
SELECT *
FROM read_csv_auto('raw/credit_file.csv', sample_size=-1)
""")
con.execute("""
COPY base TO 'data/base.parquet' (FORMAT PARQUET)
""")
con.execute("""
COPY (
SELECT
consumer_id,
CASE
WHEN var1 > 2 THEN 'A'
WHEN var2 = 4 THEN 'B'
ELSE 'C'
END AS compound_feature_1
FROM 'data/base.parquet'
) TO 'data/features_property.parquet' (FORMAT PARQUET)
""")
df = con.execute("""
SELECT
b.outcome,
f.compound_feature_1,
COUNT(*) AS n
FROM 'data/base.parquet' b
JOIN 'data/features_property.parquet' f USING (consumer_id)
WHERE b.line_of_business = 'Retail'
GROUP BY b.outcome, f.compound_feature_1
ORDER BY f.compound_feature_1, b.outcome
""").df()
print(df)
The main idea is to avoid repeatedly mutating one huge table every time a new column is created. Instead, keep the original dataset stable and store derived variables in separate feature tables keyed by an ID such as consumer_id. This is usually easier, safer, and more scalable than trying to append columns directly to an HDFStore.