Data Collection

Creating the PFR library


The data collection process was an interesting problem in itself. Most of the code involved in data collection was actually written in the pfr python package, which can be installed via pip and was written by team member Matt Goldberg. This package, which currently contains roughly 1800 lines of Python code and remains a work-in-progress, provides a framework that makes scraping NFL-related data simple and stress-free. It works by modeling each aspect of pro-football-reference.com as a python class that can be constructed using a unique identifier assigned to each player, team, and game; these classes then expose simple functions that can be used to scrape data from the relevant pages of the website. In this project, we primarily used the teams and boxscores modules of the package; we used the teams module to iterate over the teams in the league and to get the identifiers for the games they played each season (referred to on Pro Football Reference as "boxscores"). Then, after compiling a list of every NFL game played from 2002-2014, we used the boxscores module to scrape the play-by-play data from each game in the dataset. This scraping gives a raw dataset of about 450,000 plays and 115 features describing each play; the features include time left in the game, whether it was a run or a pass, and specific details about the play such as how many yards it gained, which players were involved on the play, and the result of the play (i.e., if it was a touchdown, interception, tackle, etc.). To see the scraping code itself, where the boxscore play-by-play table itself is parsed into a DataFrame, see the parseTable function in pfr's utils module.