Behind the Numbers: Building a Virtual NFL Statistician

A project 10+ years in the making

Without a doubt, one of the more common questions we field here is: “Where do you get your data?”

It’s a valid question on a number of fronts. The first comes from a legal perspective. Are we entitled to distribute this data? Does the NFL not hold the rights to data from it’s games? Beyond the legal rights issue, some people are simply wondering how accurate the data is in comparison with official league records.

The best place to start is with the following:

pbp_csv

This, my friends, is what the sporting-world refers to as a “play-by-play”. It doesn’t just appear out of thin air. League statisticians from every NFL stadium are tasked with watching plays as they unfold and marking down the details of each and every one. This information then makes it’s way from individual stadiums and within seconds of play completion, can be found on dozens of sites worldwide.

This small snippet of information is all we need to perform our magic. Nothing more, nothing less. A game might feature 160 such snippets. A week might total something in the neighbourhood of 2,500. Numbers aside, it’s what we do with these freely-available snippets of text that really set’s our site apart from the other unofficial sources of NFL data.

Relying on nothing more than play-by-play text means we don’t need to illegally scrape sites for data or rely on any other service to provide us with what we need. While sports leagues have tried to copyright basic written accounts of sporting events, they haven’t had the best of luck. Reporting on the number of times Brett Favre passed to Antonio Freeman in his career based on game accounts is no more of an infringement on copyright than reporting on the number of times a character spoke Harry Potter’s name in a J.K. Rowling book (16,612 if you are interested).

The real question is, how do we get from this:

pbp_csv

To this:

{
  "data": {
    "gid": 3990,
    "pid": 651831,
    "off": "PIT",
    "def": "NE",
    "type": "RUSH",
    "dseq": 1,
    "len": 34,
    "qtr": 1,
    "min": 7,
    "sec": 41,
    "ptso": 0,
    "ptsd": 0,
    "timo": 3,
    "timd": 3,
    "dwn": 1,
    "ytg": 10,
    "yfog": 7,
    "zone": 1,
    "fd": 0,
    "sg": 0,
    "nh": 0,
    "pts": 0,
    "tck": [
      {
        "uid": 430825,
        "tck": "GG-0475",
        "value": "0.5"
      },
      {
        "uid": 430826,
        "tck": "AB-2700",
        "value": "0.5"
      }
    ],
    "sk": 0,
    "pen": 0,
    "ints": 0,
    "fum": 0,
    "saf": 0,
    "blk": 0,
    "olid": {
      "lt": "KB-0750",
      "lg": "AV-0350",
      "c": "RF-0900",
      "rg": "CH-4550",
      "rt": "CW-0400"
    },
    "rush": {
      "bc": "DW-3600",
      "dir": "LG",
      "yds": 6,
      "succ": 1,
      "kne": 0
    }
  }
}

And even this:

2016 NFL Database Field Listing

Going from point A (play-by-play text) to points B, C, D and beyond required us to do 2 things:

  1. Build a ‘virtual’ NFL statistician able to digest those simple play-by-play accounts to create player and team statistics and everything else in-between.
  2. Hire our own real live ‘Statisticians’ to analyze game tape to fill in the holes and also expand on the basic play-by-play data.

The first task has taken over a decade (of off and on work) and probably in the neighborhood of 15,000 lines of code to achieve. This is not an exaggeration. Here is what’s required to properly parse the effect of Fumbles:

Sub FUMBLES()

If S1.Cells(ROW1, 6) = "PASS" Or S1.Cells(ROW1, 6) = "RUSH" Then
   X8 = InStr(X7, BOOK.Cells(X2, 2), "FUMBLES") + 8: X11 = 2
   If S1.Cells(ROW1, 26) <> "" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 26)
   If S1.Cells(ROW1, 35) <> "" And InStr(X7, BOOK.Cells(X2, 2), "FUMBLES") > InStr(X7, BOOK.Cells(X2, 2), "pass") Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 35)
   If InStr(X7, BOOK.Cells(X2, 2), "FUMBLES (Aborted)") > 0 Then
      For X3 = 1 To 6
      If QBNAME(X3, X6 + 1) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, 2), X7, 40), QBNAME(X3, X6 + 1)) > 0 Then S1.Cells(ROW1, 61) = "QB " & QBNAME(X3, X6 + 1): Exit For
      Next X3
      If S1.Cells(ROW1, 61) = "" Then
         For X3 = 1 To 75
         If ALLNAME(X3, X6 + 1) = "" Then Exit For
         If InStr(Mid(BOOK.Cells(X2, 2), X7, 40), ALLNAME(X3, X6 + 1)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X6 + 1) & " " & ALLNAME(X3, X6 + 1): Exit For
         Next X3
      End If
   End If
   If InStr(X7, BOOK.Cells(X2, 2), "Aborted") > 0 And InStr(X7, BOOK.Cells(X2, 2), "FUMBLES (Aborted)") = 0 Then
      For X3 = 1 To 75
      If ALLNAME(X3, X6 + 1) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, 2), InStr(X7, BOOK.Cells(X2, 2), "Aborted") + 8, 25), ALLNAME(X3, X6 + 1)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X6 + 1) & " " & ALLNAME(X3, X6 + 1): Exit For
      Next X3
   End If
   If InStr(X7, BOOK.Cells(X2, 2), "Lateral to") > 0 And InStr(X7, BOOK.Cells(X2, 2), "FUMBLES") > InStr(X7, BOOK.Cells(X2, 2), "Lateral to") Then
      For X3 = 1 To 75
      If ALLNAME(X3, X6 + 1) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, 2), InStr(X7, BOOK.Cells(X2, 2), "Lateral to") + 11, 25), ALLNAME(X3, X6 + 1)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X6 + 1) & " " & ALLNAME(X3, X6 + 1): Exit For
      Next X3
   End If
   If S1.Cells(ROW1, 61) = "" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 31)
End If

If S1.Cells(ROW1, 6) = "KOFF" Or S1.Cells(ROW1, 6) = "PUNT" Or S1.Cells(ROW1, 6) = "ONSD" Then
   If S1.Cells(ROW1, 6) = "PUNT" Then X11 = 2 Else X11 = 1
   If InStr(X7, BOOK.Cells(X2, X11), "MUFFS") > 0 Then
      X8 = InStr(X7, BOOK.Cells(X2, X11), "MUFFS") + 6
   Else
      X8 = InStr(X7, BOOK.Cells(X2, X11), "FUMBLES") + 8
   End If
   If InStr(X7, BOOK.Cells(X2, X11), "FUMBLES (Aborted)") > 0 Then
      If S1.Cells(ROW1, 72) <> "" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 72)
      If S1.Cells(ROW1, 79) <> "" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 79)
   End If
   If InStr(X7, BOOK.Cells(X2, X11), "Aborted") > 0 And InStr(X7, BOOK.Cells(X2, X11), "FUMBLES (Aborted)") = 0 Then
      If S1.Cells(ROW1, 6) = "PUNT" Then X10 = X6 + 1 Else X10 = 2 - X6
      For X3 = 1 To 75
      If ALLNAME(X3, X10) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, X11), InStr(X7, BOOK.Cells(X2, X11), "Aborted") + 8, 25), ALLNAME(X3, X10)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X10) & " " & ALLNAME(X3, X10): Exit For
      Next X3
   End If
   If InStr(X7, BOOK.Cells(X2, X11), "Lateral to") > 0 And InStr(X7, BOOK.Cells(X2, X11), "FUMBLES") > InStr(X7, BOOK.Cells(X2, X11), "Lateral to") Then
      If S1.Cells(ROW1, 6) = "PUNT" Then X10 = 2 - X6 Else X10 = X6 + 1
      For X3 = 1 To 75
      If ALLNAME(X3, X10) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, X11), InStr(X7, BOOK.Cells(X2, X11), "Lateral to") + 11, 25), ALLNAME(X3, X10)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X10) & " " & ALLNAME(X3, X10): Exit For
      Next X3
   End If
   If InStr(X7, BOOK.Cells(X2, X11), "punt is BLOCKED") > 0 And InStr(X7, BOOK.Cells(X2, X11), "FUMBLES") > InStr(X7, BOOK.Cells(X2, X11), "punt is BLOCKED") Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 67)
   If S1.Cells(ROW1, 61) = "" Then
      If S1.Cells(ROW1, 6) = "PUNT" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 76) Else S1.Cells(ROW1, 61) = S1.Cells(ROW1, 83)
   End If
End If

If InStr(X8, BOOK.Cells(X2, X11), "recovers at") > 0 Then S1.Cells(ROW1, 62) = S1.Cells(ROW1, 61)
If InStr(X8, BOOK.Cells(X2, X11), "recovered by") > 0 Then
   If S1.Cells(ROW1, 6) = "PUNT" Then X10 = 2 - X6 Else X10 = X6 + 1
   For X3 = 1 To 75
   If ALLNAME(X3, X10) = "" Then Exit For
   If InStr(Mid(BOOK.Cells(X2, X11), InStr(X8, BOOK.Cells(X2, X11), "recovered by") + 13, 25), ALLNAME(X3, X10)) > 0 Then S1.Cells(ROW1, 62) = ALLPOS(X3, X10) & " " & ALLNAME(X3, X10): Exit For
   Next X3
End If

If InStr(X8, BOOK.Cells(X2, X11), "RECOVERED by") > 0 Then
   If S1.Cells(ROW1, 6) = "PUNT" Then X10 = X6 + 1 Else X10 = 2 - X6
   For X3 = 1 To 75
   If ALLNAME(X3, X10) = "" Then Exit For
   If InStr(Mid(BOOK.Cells(X2, X11), InStr(X8, BOOK.Cells(X2, X11), "RECOVERED by") + 13, 25), ALLNAME(X3, X10)) > 0 Then S1.Cells(ROW1, 62) = ALLPOS(X3, X10) & " " & ALLNAME(X3, X10): Exit For
   Next X3
   If S1.Cells(ROW1, 62) <> "" Then
      X9 = InStr(X7, BOOK.Cells(X2, X11), "RECOVERED by") + 13
      If InStr(Mid(BOOK.Cells(X2, X11), X9, 90), "for no gain") + InStr(X9, BOOK.Cells(X2, X11), "Touchback") > 0 Or InStr(Mid(BOOK.Cells(X2, X11), X9, 90), " for ") = 0 Then S1.Cells(ROW1, 63) = 0
      If S1.Cells(ROW1, 63) = "" Then
         If IsNumeric(Mid(BOOK.Cells(X2, X11), InStr(X9, BOOK.Cells(X2, X11), " for ") + 7, 1)) = True Then S1.Cells(ROW1, 63) = Mid(BOOK.Cells(X2, X11), InStr(X9, BOOK.Cells(X2, X11), " for ") + 5, 3) Else S1.Cells(ROW1, 63) = Mid(BOOK.Cells(X2, X11), InStr(X9, BOOK.Cells(X2, X11), " for ") + 5, 2)
      End If
   End If
End If

End Sub

Building a customized ‘rules engine’ that produces player and team data with 100% accuracy means accounting for every possible event on a Rush; Pass; Punt; Kickoff; Field Goal; Extra Point or 2pt Conversion — no easy task.

The news is good however: we are now close to 99.9% accuracy on plays from 2000 – 2014 and can say with confidence that we are at 100% for 2015. By the end of the 2015 season, we have plans to be able to mirror official league records with 100% accuracy from 2001 onwards.

You might wonder though: why on Earth would we take the time to build a complex system that can construct player and team statistics from nothing more than play-by-play text?

Firstly, we feel there should be no free lunch. Stealing significant portions of player and team stats from other sites is not something we condone and it’s not a viable business model long-term. The raw play-by-plays we work with would be largely useless to businesses and fantasy football players that need highly organized data in a relational database and this is where our “Sweat of the Brow” comes in.

Secondly, diving down to the play level gives us the flexibility to analyze the data in thousands, if not hundreds of thousands of ways. It’s a level of detail you will simply not find anywhere else.

So, next time you come across a site pushing unofficial data, be sure to ask the important question, “Where do you get your data”.

7 thoughts on “Behind the Numbers: Building a Virtual NFL Statistician

  1. You wrote everything in VBA?

    Seems like not the ideal language for this type of work. Any reasoning why? Or is the code sample just an example of what needs to happen for parsing?

    Do you have data scientists on staff to do any type of mining with your data?

    1. Ha, yes — very astute of you to pick up on the VBA.

      Much of the work I do involves the use of Excel hence the legacy VBA code. This project was actually started 15 years ago when the landscape of the web was obviously quite different. It does it’s job, never-the-less 😉

      As far as data scientists go – that would be me.

  2. I got started with VBA, it’s cool to see it being used for something at this scale.

    I would think Python would be a good way to handle most of these tasks. If you start porting code and are looking for help let me know 🙂

  3. I am working on my MIS masters which is based on big data. I needed a place to practice SQL. I am using your site for data for fun.

    I had a question. I built a postgres database off of your CSV files using the schema you give for the casual user.

    I did the query below:
    SELECT lname, fname, team, pos1, sum(fp) AS NFL_COM, sum(fp2) AS FANDUEL, sum(fp3) AS DRAFTKINGS,
    RANK() OVER (ORDER BY sum(fp) DESC) AS NFL_COM_RANK,
    RANK() OVER (ORDER BY sum(fp2) DESC) AS FANDUEL,
    RANK() OVER (ORDER BY sum(fp3) DESC) AS DRAFTKINGS
    FROM offense JOIN player ON offense.player = player.player
    WHERE year = ‘2015’
    GROUP BY lname, fname, team, pos1;

    By using the query I am able to produce a ranking of players by position for 2015. When I did this for years before 2015 I got data for NFL.COM (fp), FanDuel(fp2) and DraftKings(fp3).

    Question: I got data for years 2000-2014 but FanDuel and DraftKings were established after 2009 and 2012. How did you come up with stats before 2009? Did you just compare player stats with each sites point systems?

    Why? I was doing analysis on the best position and if the position has changed over the years. I.E. QB looks to be fantasy gold for 2015. But before 2015 it may have been the RB.

    Just curious so I can continue to analyze.

    LOVE YOUR SITE!!!!

    1. Hey Robert,

      You are correct: I’ve basically just applied the FD/DK scoring systems to all years since 2000.

      I’m glad to hear you are leveraging the data for your Masters Degree. Our data is used extensively at educational institutions right across America so you are in good company.

  4. I started using VBA for football as well, and it actually helped me in my career. Glad I found the site as I’ve been looking for affordable & reliable data sources for years. I’ve been looking at a way to compile game stats for players without having to pay thousands of dollars for it, especially since it’s for my own personal use. I even looked at doing what you ultimately did, going through the play by play logs and parsing it. I can imagined the code you had to write to make sure it handled everything correctly.

    This will make a nice addition to my Fantasy Football Draft Workbook this year.

    1. Hi Brent,

      Glad to hear you see some potential with the dataset and I look forward to having you come on board in 2016! The algorithms to parse the play-by-plays have evolved over the past decade. Standard plays are fairly straight-forward to parse – it’s all the weird edge cases that have required the most time…

      Dennis

Leave a Reply

Your email address will not be published. Required fields are marked *