A Train / Test Split
Description
In this final collection of consolidated-loan datasets, we construct a train / test split as follows. We assume the analyst is training (i.e. fitting) their models on data from up to Year 2014 (the first 15 years of acquisition data) and validating / testing their models on data from Year 2015 onwards. Therefore, we construct a train set by considering the loans in the random sample discussed in Section [data_all_years_sample.ipynb] that were acquired prior to Year 2015. For these loans, however, the consolidation of loan data is done with respect to their histories up to the end of Year 2014 only. That is, we set the values of their dynamic fields to those that existed on December 2014 or when the loan reached zero balance whichever was earlier. This strictly ensures no ‘peeking ahead’ in time. For the test / validation set, we consider the loans in the random sample that were acquired in Year 2015 onward. The test set loans are consolidated using their full histories (that is up to the end of Quarter 1 of Year 2023). These train and test sets are provided in several file formats.
Train set
This train dataset consists of 2,244,409 loans.
Download delimited <.csv> flat file (delimiter: “|”)
- Uncompressed (~585MB) filename: <train_sample_consolidated.csv>
- Compressed (~125MB) filename: <train_sample_consolidated_csv.zip>
Download feather <.feather> file
- Uncompressed (~285MB) filename: <train_sample_consolidated.feather>
- Compressed (~140MB) filename: <train_sample_consolidated_feather.zip>
Test set
This test dataset consists of 1,653,349 loans.
Download delimited <.csv> flat file (delimiter: “|”)
- Uncompressed (~500MB) filename: <test_sample_consolidated.csv>
- Compressed (~100MB) filename: <test_sample_consolidated_csv.zip>
Download feather <.feather> file
- Uncompressed (~225MB) filename: <test_sample_consolidated.feather>
- Compressed (~115MB) filename: <test_sample_consolidated_feather.zip>