Working with Large Datasets

Analytics-savvy managers and business analysts recognize and are equipped to handle the challenges inherent in working with large datasets. Importing, manipulating, and pre-processing data for effective analysis is typically one of the most time and resource consuming tasks of the analytics pipeline. It is also one that is higly error-prone and costly to fix, for errors ‘baked’ into data cannot always be uncovered during analysis. You can expect the data you’ll encounter professionally to keep growing in volume (records, attributes, and files), variety of formats, and velocity of updates. As a rough guideline, you can consider datasets / dataframes exceeding 1GB in size to be “large”¹ because such datasets can pose computing time and memory challenges on currently available personal computers.

One learning objective of this case is to gain familiarity in working with large structured datasets. The median size of the quarterly datasets posted by Fannie Mae is approximately 400-500 MB in compressed format². Here are a few general guidelines:

Familiarize yourself with binary file formats. You may be most familiar with delimited flat files, such as comma separated values (csv) files, or with worksheets. While these have the advantage of being human-readable (i.e. you can open the data in a text editor or spreadheet program and inspect the data directly), they are generally inefficient for processing and storage. The data science community has develped highly efficient binary file formats. These include Apache Parquet, Feather, HDF, and (specific to Python) Pickle. The good news is that Python’s pandas library (and, to some extent, R) have functions enabling you to import and export files into any of these formats. See [https://wesmckinney.com/book/accessing-data.html#io_binary]. For example, instead of read_csv and to_csv you can use read_feather and to_feather just as easily.
Import using defined data types. Each column of a dataframe has a defined data type (e.g. floating number, string, etc.)³. When importing data, it is good general practice to define the column data types if they are known. This ensures that the columns have the intended data type⁴ and are suitable for later processing⁵. For large datasets, defining data types during import is particularly important because it avoids the time-consuming parsing operation where the import function tries to determine/guess that correct data type. Most pandas import functions have a dtype argument for that purpose. One additional advantage of binary file formats is that data types are included in the files meta-data saving the need to re-define the types each time the file is read.
Process data in chunks where possible. Often times you can reduce memory requirement and processing times by processing data in chunks. This is possible using the ‘chunksize’ parameter in the case of pandas’ read_csv function. To take a simple example, suppose you find the sum of column values in a dataframe that consists of numerical columns. You can limit memory use by reading the dataframe in chunks, summing the values in each chunk, and then adding the subtotals. Processing data in chunks can also reduce processing times because the processing times of many operations grows faster than linear⁶. When summarizing loan level data in the Fannie Mae dataset it is generally more efficient to process loan data at the quarterly level first and then concatenate the loans than to combine the quarters first and then process a larger dataset.

Footnotes

‘Big data’, whose size exceeds, say, 50GB requires specialized and cloud-based tools beyond the scope of this case↩︎
The compression ratio is quite high; 15:1 or higher↩︎
Different columns can have different data types↩︎
For example, if a column representing day of week, say, is coded 1-7, you would need to define this column as a string or categorical data type for analysis purposes. Python or R cannot infer a column’s meaning↩︎
Date types are especially useful when analyzing time series↩︎
Think of sorting, for example↩︎