Cleaning and Feature Creation

Prepping a dataset for analysis

After finishing the data collection, we moved on to data cleaning and feature engineering. Due to some instances of missing data from our scraped dataset, we were able to create necessary columns and new features with given information, and then, when no longer needed, drop rows where missing data was significant enough to warrant that. More interestingly, using scraped columns and new indicators created at this stage (such as if the team with the ball was the home team), we were able to generate many new columns that were crucial for our models. New features that we generated include proportion of plays that were passes in the previous season, current season, and in the current game; the scoring margin; seconds elapsed in the half; and timeouts remaining in the half for each team. Additionally, we created indicator features such as distance to the first-down marker (discretized into "short," "medium," "long," and "very long"), whether the team was in field goal range, whether the team was in the red zone, whether there were less than 3 minutes remaining in the half, and if the team's previous play from scrimmage was a pass. Many of these features would prove crucial to the success of our modeling, as the engineering of features we know as football fans to be important in the run/pass decision were able to assist our models by providing them with a form of domain knowledge.