All posts by derny

Behind the Numbers: Building a Virtual NFL Statistician

A project 10+ years in the making

Without a doubt, one of the more common questions we field here is: “Where do you get your data?”

It’s a valid question on a number of fronts. The first comes from a legal perspective. Are we entitled to distribute this data? Does the NFL not hold the rights to data from it’s games? Beyond the legal rights issue, some people are simply wondering how accurate the data is in comparison with official league records.

The best place to start is with the following:

pbp_csv

This, my friends, is what the sporting-world refers to as a “play-by-play”. It doesn’t just appear out of thin air. League statisticians from every NFL stadium are tasked with watching plays as they unfold and marking down the details of each and every one. This information then makes it’s way from individual stadiums and within seconds of play completion, can be found on dozens of sites worldwide.

This small snippet of information is all we need to perform our magic. Nothing more, nothing less. A game might feature 160 such snippets. A week might total something in the neighbourhood of 2,500. Numbers aside, it’s what we do with these freely-available snippets of text that really set’s our site apart from the other unofficial sources of NFL data.

Relying on nothing more than play-by-play text means we don’t need to illegally scrape sites for data or rely on any other service to provide us with what we need. While sports leagues have tried to copyright basic written accounts of sporting events, they haven’t had the best of luck. Reporting on the number of times Brett Favre passed to Antonio Freeman in his career based on game accounts is no more of an infringement on copyright than reporting on the number of times a character spoke Harry Potter’s name in a J.K. Rowling book (16,612 if you are interested).

The real question is, how do we get from this:

pbp_csv

To this:

{
  "data": {
    "gid": 3990,
    "pid": 651831,
    "off": "PIT",
    "def": "NE",
    "type": "RUSH",
    "dseq": 1,
    "len": 34,
    "qtr": 1,
    "min": 7,
    "sec": 41,
    "ptso": 0,
    "ptsd": 0,
    "timo": 3,
    "timd": 3,
    "dwn": 1,
    "ytg": 10,
    "yfog": 7,
    "zone": 1,
    "fd": 0,
    "sg": 0,
    "nh": 0,
    "pts": 0,
    "tck": [
      {
        "uid": 430825,
        "tck": "GG-0475",
        "value": "0.5"
      },
      {
        "uid": 430826,
        "tck": "AB-2700",
        "value": "0.5"
      }
    ],
    "sk": 0,
    "pen": 0,
    "ints": 0,
    "fum": 0,
    "saf": 0,
    "blk": 0,
    "olid": {
      "lt": "KB-0750",
      "lg": "AV-0350",
      "c": "RF-0900",
      "rg": "CH-4550",
      "rt": "CW-0400"
    },
    "rush": {
      "bc": "DW-3600",
      "dir": "LG",
      "yds": 6,
      "succ": 1,
      "kne": 0
    }
  }
}

And even this:

2016 NFL Database Field Listing

Going from point A (play-by-play text) to points B, C, D and beyond required us to do 2 things:

  1. Build a ‘virtual’ NFL statistician able to digest those simple play-by-play accounts to create player and team statistics and everything else in-between.
  2. Hire our own real live ‘Statisticians’ to analyze game tape to fill in the holes and also expand on the basic play-by-play data.

The first task has taken over a decade (of off and on work) and probably in the neighborhood of 15,000 lines of code to achieve. This is not an exaggeration. Here is what’s required to properly parse the effect of Fumbles:

Sub FUMBLES()

If S1.Cells(ROW1, 6) = "PASS" Or S1.Cells(ROW1, 6) = "RUSH" Then
   X8 = InStr(X7, BOOK.Cells(X2, 2), "FUMBLES") + 8: X11 = 2
   If S1.Cells(ROW1, 26) <> "" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 26)
   If S1.Cells(ROW1, 35) <> "" And InStr(X7, BOOK.Cells(X2, 2), "FUMBLES") > InStr(X7, BOOK.Cells(X2, 2), "pass") Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 35)
   If InStr(X7, BOOK.Cells(X2, 2), "FUMBLES (Aborted)") > 0 Then
      For X3 = 1 To 6
      If QBNAME(X3, X6 + 1) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, 2), X7, 40), QBNAME(X3, X6 + 1)) > 0 Then S1.Cells(ROW1, 61) = "QB " & QBNAME(X3, X6 + 1): Exit For
      Next X3
      If S1.Cells(ROW1, 61) = "" Then
         For X3 = 1 To 75
         If ALLNAME(X3, X6 + 1) = "" Then Exit For
         If InStr(Mid(BOOK.Cells(X2, 2), X7, 40), ALLNAME(X3, X6 + 1)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X6 + 1) & " " & ALLNAME(X3, X6 + 1): Exit For
         Next X3
      End If
   End If
   If InStr(X7, BOOK.Cells(X2, 2), "Aborted") > 0 And InStr(X7, BOOK.Cells(X2, 2), "FUMBLES (Aborted)") = 0 Then
      For X3 = 1 To 75
      If ALLNAME(X3, X6 + 1) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, 2), InStr(X7, BOOK.Cells(X2, 2), "Aborted") + 8, 25), ALLNAME(X3, X6 + 1)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X6 + 1) & " " & ALLNAME(X3, X6 + 1): Exit For
      Next X3
   End If
   If InStr(X7, BOOK.Cells(X2, 2), "Lateral to") > 0 And InStr(X7, BOOK.Cells(X2, 2), "FUMBLES") > InStr(X7, BOOK.Cells(X2, 2), "Lateral to") Then
      For X3 = 1 To 75
      If ALLNAME(X3, X6 + 1) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, 2), InStr(X7, BOOK.Cells(X2, 2), "Lateral to") + 11, 25), ALLNAME(X3, X6 + 1)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X6 + 1) & " " & ALLNAME(X3, X6 + 1): Exit For
      Next X3
   End If
   If S1.Cells(ROW1, 61) = "" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 31)
End If

If S1.Cells(ROW1, 6) = "KOFF" Or S1.Cells(ROW1, 6) = "PUNT" Or S1.Cells(ROW1, 6) = "ONSD" Then
   If S1.Cells(ROW1, 6) = "PUNT" Then X11 = 2 Else X11 = 1
   If InStr(X7, BOOK.Cells(X2, X11), "MUFFS") > 0 Then
      X8 = InStr(X7, BOOK.Cells(X2, X11), "MUFFS") + 6
   Else
      X8 = InStr(X7, BOOK.Cells(X2, X11), "FUMBLES") + 8
   End If
   If InStr(X7, BOOK.Cells(X2, X11), "FUMBLES (Aborted)") > 0 Then
      If S1.Cells(ROW1, 72) <> "" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 72)
      If S1.Cells(ROW1, 79) <> "" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 79)
   End If
   If InStr(X7, BOOK.Cells(X2, X11), "Aborted") > 0 And InStr(X7, BOOK.Cells(X2, X11), "FUMBLES (Aborted)") = 0 Then
      If S1.Cells(ROW1, 6) = "PUNT" Then X10 = X6 + 1 Else X10 = 2 - X6
      For X3 = 1 To 75
      If ALLNAME(X3, X10) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, X11), InStr(X7, BOOK.Cells(X2, X11), "Aborted") + 8, 25), ALLNAME(X3, X10)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X10) & " " & ALLNAME(X3, X10): Exit For
      Next X3
   End If
   If InStr(X7, BOOK.Cells(X2, X11), "Lateral to") > 0 And InStr(X7, BOOK.Cells(X2, X11), "FUMBLES") > InStr(X7, BOOK.Cells(X2, X11), "Lateral to") Then
      If S1.Cells(ROW1, 6) = "PUNT" Then X10 = 2 - X6 Else X10 = X6 + 1
      For X3 = 1 To 75
      If ALLNAME(X3, X10) = "" Then Exit For
      If InStr(Mid(BOOK.Cells(X2, X11), InStr(X7, BOOK.Cells(X2, X11), "Lateral to") + 11, 25), ALLNAME(X3, X10)) > 0 Then S1.Cells(ROW1, 61) = ALLPOS(X3, X10) & " " & ALLNAME(X3, X10): Exit For
      Next X3
   End If
   If InStr(X7, BOOK.Cells(X2, X11), "punt is BLOCKED") > 0 And InStr(X7, BOOK.Cells(X2, X11), "FUMBLES") > InStr(X7, BOOK.Cells(X2, X11), "punt is BLOCKED") Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 67)
   If S1.Cells(ROW1, 61) = "" Then
      If S1.Cells(ROW1, 6) = "PUNT" Then S1.Cells(ROW1, 61) = S1.Cells(ROW1, 76) Else S1.Cells(ROW1, 61) = S1.Cells(ROW1, 83)
   End If
End If

If InStr(X8, BOOK.Cells(X2, X11), "recovers at") > 0 Then S1.Cells(ROW1, 62) = S1.Cells(ROW1, 61)
If InStr(X8, BOOK.Cells(X2, X11), "recovered by") > 0 Then
   If S1.Cells(ROW1, 6) = "PUNT" Then X10 = 2 - X6 Else X10 = X6 + 1
   For X3 = 1 To 75
   If ALLNAME(X3, X10) = "" Then Exit For
   If InStr(Mid(BOOK.Cells(X2, X11), InStr(X8, BOOK.Cells(X2, X11), "recovered by") + 13, 25), ALLNAME(X3, X10)) > 0 Then S1.Cells(ROW1, 62) = ALLPOS(X3, X10) & " " & ALLNAME(X3, X10): Exit For
   Next X3
End If

If InStr(X8, BOOK.Cells(X2, X11), "RECOVERED by") > 0 Then
   If S1.Cells(ROW1, 6) = "PUNT" Then X10 = X6 + 1 Else X10 = 2 - X6
   For X3 = 1 To 75
   If ALLNAME(X3, X10) = "" Then Exit For
   If InStr(Mid(BOOK.Cells(X2, X11), InStr(X8, BOOK.Cells(X2, X11), "RECOVERED by") + 13, 25), ALLNAME(X3, X10)) > 0 Then S1.Cells(ROW1, 62) = ALLPOS(X3, X10) & " " & ALLNAME(X3, X10): Exit For
   Next X3
   If S1.Cells(ROW1, 62) <> "" Then
      X9 = InStr(X7, BOOK.Cells(X2, X11), "RECOVERED by") + 13
      If InStr(Mid(BOOK.Cells(X2, X11), X9, 90), "for no gain") + InStr(X9, BOOK.Cells(X2, X11), "Touchback") > 0 Or InStr(Mid(BOOK.Cells(X2, X11), X9, 90), " for ") = 0 Then S1.Cells(ROW1, 63) = 0
      If S1.Cells(ROW1, 63) = "" Then
         If IsNumeric(Mid(BOOK.Cells(X2, X11), InStr(X9, BOOK.Cells(X2, X11), " for ") + 7, 1)) = True Then S1.Cells(ROW1, 63) = Mid(BOOK.Cells(X2, X11), InStr(X9, BOOK.Cells(X2, X11), " for ") + 5, 3) Else S1.Cells(ROW1, 63) = Mid(BOOK.Cells(X2, X11), InStr(X9, BOOK.Cells(X2, X11), " for ") + 5, 2)
      End If
   End If
End If

End Sub

Building a customized ‘rules engine’ that produces player and team data with 100% accuracy means accounting for every possible event on a Rush; Pass; Punt; Kickoff; Field Goal; Extra Point or 2pt Conversion — no easy task.

The news is good however: we are now close to 99.9% accuracy on plays from 2000 – 2014 and can say with confidence that we are at 100% for 2015. By the end of the 2015 season, we have plans to be able to mirror official league records with 100% accuracy from 2001 onwards.

You might wonder though: why on Earth would we take the time to build a complex system that can construct player and team statistics from nothing more than play-by-play text?

Firstly, we feel there should be no free lunch. Stealing significant portions of player and team stats from other sites is not something we condone and it’s not a viable business model long-term. The raw play-by-plays we work with would be largely useless to businesses and fantasy football players that need highly organized data in a relational database and this is where our “Sweat of the Brow” comes in.

Secondly, diving down to the play level gives us the flexibility to analyze the data in thousands, if not hundreds of thousands of ways. It’s a level of detail you will simply not find anywhere else.

So, next time you come across a site pushing unofficial data, be sure to ask the important question, “Where do you get your data”.

Not Your Dad’s NFL Stats

If you want to build something different it starts with innovative data

We don’t scrape and regurgitate readily available NFL data because that’s just plain boring. Our goal at ArmchairAnalysis.com has always been to provide cutting-edge NFL stats that can be used to build innovative NFL-related applications that push the boundaries of grid-iron analysis.

Take our breakdown of NFL penalties as one example.

Here’s a fun little fact: Between the 2000 and 2013 there were roughly 50,064 penalties called in the NFL with a total of over 352,500 penalty yards being assessed in the process.

That’s no small number and it adds up to a huge impact on hundreds of game outcomes.

All penalties are not created equal; however, and our goal is to provide you with the metrics that will help you determine what does, and doesn’t, matter when it comes to game outcomes; fantasy point totals — you name it.

Over the past few seasons, we’ve come up with the following 7 categories that account for 99.6% of all flags thrown in the NFL.

#1 False Start Penalties (Offense)

This is one of the most common of all penalty calls and deserves it’s very own category

#2 Offensive Holding Penalties

This category consists mostly of holding calls on the Offensive Line.

Most common flags: Offensive Holding, Offensive Pass Interference & Illegal Use of Hands.

#3 Playbook Execution Penalties

This is my favourite category. These tend to occur more often at the start of the season, and, against teams that try and skimp on headsets for the coaching staff (had you going there, right?).

Most common flags: Illegal Formations, Passes, Substitutions, Procedures and Motion; Intentional Grounding, 12 On Field, Delay of Game.

#4 Defensive Line Penalties

It’s all about the timing. And, not drinking excessive amounts of caffeine too close to game time.

Most common flags: Defensive Offside, Encroachment, Neutral Zone Infractions.

#5 Defensive Secondary Penalties

Cornerbacks and Safeties have a tough job, but, some just do it better than others.

Most common flags: Defensive Holding, Defensive Pass Interference, Illegal Contact & Use of Hands.

#6 Dumb Penalties

Teams that are overly aggressive on defense (I’m looking at you Ravens) are more likely to draw these types of flags. Emotion aside, some of these really are just plain stupid.

Most common flags: Taunting, Roughing the Passer/Kicker, Face Masks, Unsportsmanlike Conduct.

#7 Poor Fundamentals

This is a category of penalties where the quality of a team’s defensive coach and staff can be measured. Some like tripping, border on dumb. But, if you need to trip someone in the first place it’s usually because you’re not doing your job very well.

Most common flags: Illegal Blocks, Crackbacks, Tripping, Horsecollars, Clipping.

Here is our list shown in a handy table, revealing the actual number of penalties for each category and the average yards-per-call.

Category #Flags Yards Avg
False Starts 9,363 45,890 4.90
Offensive Holding 10,842 85,514 7.89
Playbook Execution 6,652 28,477 4.28
Defensive Line 5,662 22,126 3.91
Defensive Secondary 7,506 61,916 8.25
Dumb/Aggressive 7,343 85,067 11.58
Poor Fundamentals 2,493 22,442 9.00

This is just one example of our quest to innovate and present NFL data that is just plain fun to explore. Stay tuned for many more examples to come!

Data For Fantasy Football Players Gets an Update

The Biggest just got Bigger

Exciting news people! Data in our Player, Offense and Defense tables just got a whole lot more interesting, especially for those of you who use our data to engage in Fantasy Football related pursuits.

Starting in the 2015 season, we will now be providing the following 3 pieces of information in the aforementioned tables:

  • Position Detailed (POSD) – More granular than POS1 and POS2 (i.e., Linebackers are broken down into MLB, LILB, ROLB and so on).
  • Jersey Number – 1 thru 99.
  • Depth Chart Position – Value of 1-3 (1-5 in the pre-season).

What this means for application developers is that you can now accurately build a team’s complete roster within your app and easily link to the wealth of information we provide across all of our other data tables.

For NFL Fantasy Football players, the implications are clear: Point totals can now be analyzed within the context of where a player sat on the depth chart and their specific role on offense or defense prior to the game.

In order to achieve this level of detail, we now update the PLAYER table on a daily basis so the information that is recorded in the OFFENSE and DEFENSE tables will be based on the latest information provided by teams on game day.

Stay tuned for news on more exciting changes coming in the next couple of weeks!

Daily NFL Roster Updates in 2015!

We are happy to announce that we will be starting to offer daily roster updates for the 2015 season. Any changes to the 90-man pre-season rosters as well as the regular season 53-man roster (plus practice squads) will reflected in our database within 24 hours of any transaction.

In addition, player jersey numbers will now be part of the Players table.

Look for these changes to be implemented by the start of this year’s pre-season (Week 1), Thursday, August 13th, 2015. Daily updates will occur at approximately 9am EST and both our API and static files (CSV and MySQL) will be updated at the same time.

Are you ready for some FOOTBALL?

The Wait for a Decent NFL API is Over

Great news NFL fans: our NFL API is now ready for consumption.

The API is built from the ground up to power complex NFL client-side apps and was designed by developers who also happen to love the game of Pro Football.

It gives you instant access to all the custom statistics that users of our SQL-based databases have come to know and love.

The best part is: it won’t stop here. We are working hard on a process that will allow us to release our game-charted statistics on a weekly basis this coming September. Stay tuned for more details in the next couple of months.

Are you ready for some (statistically nerdy) Football?

2014 NFL Data Is Now Available

Great news NFL fans: our historical NFL data-set has been updated to include all plays from the 2014 season bringing the grand total to 651,765 spanning the past 15 NFL seasons.

Also new in this update: JSON formatted data that can easily be imported into MongoDB or any other NoSQL database of your choice.

Stay tuned for more exciting changes to come in what is sure to be a busy off-season at ArmchairAnalysis.com!