- Consider the type of classification outcome so that you know how to structure your code, whether it be in R, Python, PySpark, etc. Is it binomial or multinomial classification?
- There are many classifications algorithms that you can use. E.g., logistic regression is simple and its results (typical statistical significance, log odds, etc.) are easy to explain whereas random forest will likely outperform logit but is harder to explain beyond feature importance.
- For all classification models, you'll need to compare the predicted values vs. the actual values in order to calculate the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). From there you can calculate the following metrics:
- Accuracy = TP+TN/TP+FP+FN+TN
- Precision = TP/TP+FP
- Recall = TP/TP+FN
- F1 Score = 2*(Recall * Precision) / (Recall + Precision)
*Note that F1 score should be used instead of accuracy when your input dataset has a class imbalance problem.