Skip to content
Andrew Gallant edited this page Sep 11, 2013 · 19 revisions

One of the goals of nfldb is for it to be simple to use. And for it to be simple to use, the data model should also be simple.

There are only 7 tables in the database. Here is a brief description of each:

  • meta stores information about the database or about the state of the world. For example, it keeps track of the version of the database and the current week of the current NFL season.
  • team stores a row for each team in the league. There is also a row that corresponds to an unknown team called UNK. This is used for players that are not on any current roster.
  • player stores ephemeral data about players. Namely, it is the most current information about each player known by nfldb. The data is nearly a total copy of the data in nflgame's JSON player database.
  • game stores a row for each NFL game in the preseason, regular season and postseason dating back to 2009. This includes games that are scheduled in the future but have not been played.
  • drive stores a row for each drive in a single game.
  • play stores a row for each play in a single drive.
  • play_player stores a row for each player statistic in a single play.

You can get an overview of the entire database and the relationships between each table with the Entity-Relationship (ER) diagrams section of this page.

What kind of player meta data is stored?

The data in the player table corresponds to information scraped off of roster and player profile pages on NFL.com. (In fact, this is the only data in nfldb that is scraped.) NFL.com pages are used so that players can be matched with their statistical data via unique identifiers, rather than having to rely on a fuzzy name matching algorithm.

This data includes players who are no longer playing. In this case, their team is UNK. This leads to a nice property of the data in the player table: any player with a team not equal to UNK is currently on that team's roster. Therefore, the roster of a team as known by nfldb can easily be accessed with nfldb's query interface:

import nfldb

db = nfldb.connect()
query = nfldb.Query(db).player(team='NE')
for p in query.sort(('status', 'asc')).as_players():
    print p.full_name, p.position, p.status

Whether the data in the player table is current or not depends on how quickly NFL.com updates their data and whether you're updating your database frequently. In my experience, NFL.com's data can be slow to update during the offseason, but is relatively quick during the season.

Finally, it is important to note that most of the columns in the player table can be NULL. This means that not all data is available for all players. (We are at the mercy of the consistency of NFL.com's roster and player profile pages.) In my experience, the data is usually very complete for active players.

What is the play_player table?

The play_player table is arguably the most important table in the entire database. Namely, it is the only table which stores player statistics. Each row in this table corresponds to statistics recorded by a single player in a single play.

Let's look at a fairly complex play that occurred in the Eagles/Redskins game in the first week of the 2013 regular season:

(13:55) (No Huddle, Shotgun) M.Vick pass short right to J.Avant to PHI 33 
for 6 yards (J.Wilson). FUMBLES (J.Wilson), RECOVERED by WAS-P.Riley at
PHI 35. P.Riley to PHI 29 for 6 yards (J.Avant).

This particular play has a gsis_id of 2013090900, a drive_id of 21 and a play_id of 3717. We can then use that information to see the statistics recorded by each player in that play. (Note that I've restricted the SELECT fields in the query below to make the output readable here. You may want to try SELECT * ....)

SELECT
    full_name, passing_yds, receiving_rec, receiving_yds,
    fumbles_forced, defense_ffum, defense_frec_yds
FROM play_player
LEFT JOIN player ON player.player_id = play_player.player_id
WHERE (gsis_id, drive_id, play_id) = ('2013090900', 21, 3717)
  full_name   | passing_yds | receiving_rec | receiving_yds | fumbles_forced | defense_ffum | defense_frec_yds 
--------------+-------------+---------------+---------------+----------------+--------------+------------------
 Michael Vick |           6 |             0 |             0 |              0 |            0 |                0
 Jason Avant  |           0 |             1 |             6 |              1 |            0 |                0
 Josh Wilson  |           0 |             0 |             0 |              0 |            1 |                0
 Perry Riley  |           0 |             0 |             0 |              0 |            0 |                6

ER diagrams

Entity-Relationship (ER) diagrams are used to graphically represent the schema of a database. They show each entity, its attributes and the relationships between each entity. In the ER diagrams for nfldb, entities correspond to tables and attributes correspond to columns in a table.

An example of a relationship would be one-to-many between games and drives. Namely, for each game, there can be zero or more drives associated with that game and for each drive, there must be exactly one game associated with that drive.

Note that the ER diagrams do not contain derived fields. You can see documentation for each statistical category (including derived fields) on the statistical categories page.

There are two ER diagrams. The first is a condensed version that omits many of the statistical categories for plays and players. (Click on the image to get a full PDF of the ER diagram.)

Shortened ER diagram for nfldb

The second is a full ER diagram, with all of the statistical categories:

Full ER diagram for nfldb

If you're curious, the above ER diagrams are automatically generated with the nfldb-write-erwiz script using erwiz. If you'd like to use erwiz for your own projects, I host its documentation with examples since I was unable to find it elsewhere.