Why is multicollinearity an issue for some clustering methods?

Question

Imagine you have data for a group of patients in a medical study. Your features include height, blood pressure, oral temperature, forehead temperature, sex, weight and a battery of other symptoms associated with an illness you're interested in studying. You cluster the patients, but find that your clusters look odd - what about your data might be causing this issue and why?

Nathan S. · Accepted Answer

You realize that your results are dominated by temperature measurements - individuals with similar temperature measurements are clustered together to a relatively high degree, while individuals with similarities in other measurements are not. This is because when you added forehead and oral temperature, you essentially included the same concept, body temperature, twice. This is bad for two reasons - one: it over-weights body temperature as a concept (remember how it seemed like our clusters put the people with similar temperatures together but didn't necessarily do the same for other types of symptoms?) and two: it ruins the story our cluster model tells. Clustering is useful for its ability to segment data (patients, customers, etc) into intelligible and meaningful groups, so when we overweight one concept, we make it harder for our audience to read the segments in the underlying data.

Note that this assumes that both temperature readings are correlated and essentially representative of the same underlying concept. If there were some meaningful information to be gleaned from the difference between the two measurements then we might want to include both features.

Why is multicollinearity an issue for some clustering methods?

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

I’m stuck on a machine learning homework using scikit-learn. Can someone walk me through it?

How do I tune hyperparameters in XGBoost or Random Forest?

RECOMMENDED TUTORS

IXL

Rosetta Stone

Education.com

TPT

Vocabulary.com

ABCya

SpanishDictionary.com

Inglés.com

Emmersion

Why is multicollinearity an issue for some clustering methods?

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

I’m stuck on a machine learning homework using scikit-learn. Can someone walk me through it?

How do I tune hyperparameters in XGBoost or Random Forest?

RECOMMENDED TUTORS

find an online tutor