MLB's Contact Crisis
This is completely a first draft / work-in-progress. Please point out typos, confusing bits, or outright mistakes in the comments! -EMV
A spectre is haunting North America: the spectre of declining baseball attendance. MLB attendance peaked in 2007 at 32,696 per game; last year it was down 12% from that high, at 28,768. This year to date, it’s 27,640.
That the audience for baseball may be aging is a natural fear. And the notion that Americans might eventually lose interest in the sport in favor of less subtle and more lurid and violent ones seems reasonable as well. Theodore Sturgeon’s 1964 “How to Forget Baseball” remains the only science fiction story ever published in Sports Illustrated, and posits a future where baseball is played only in Amish-like enclaves. The underrated 1969 novel The Last Man is Out is set in a future where MLB has contracted significantly and moved to smaller cities.
The good news is that there’s no evidence that either the aging of the audience or the imagined growing irrelevance of the game itself is behind the attendance drop.
In fact, it’s possible to model annual attendance in the post-PED era with terrific accuracy (adjusted r-squared = .92) with just two broad on-the-field factors. 
The first of these factors is unsurprising: competitive imbalance. It’s best measured by the kurtosis of the distribution of winning percentages in a given year.  Competitive imbalance constitutes about 40% of this model overall, although the current contribution is quite a bit smaller. The CI factor itself is composed of 57% of the CI in the present year and 43% of the previous year’s CI, which makes perfect sense. Here’s the model holding CI fixed at the 2004-2018 average:
The two great years for competitive balance were 2007 and 2014; 2012 and 2015 were good. The unimaginably awful year of 2003 shows up hugely the next year (which was perfectly ordinary); 2005, 2006, 2013, and 2018 were bad. The good-bad-good sandwich from 2012 to 2014 flattens the model curve in that stretch in a way that’s misleading.
There was no tendency at all for worsening CI through 2017, which sported a better-than-average kurtosis of -0.72. But that was the year the notorious ex-tankers the Astros won the WS, and the trend since then (-0.42, and +0.13 so far this year) has a 11% chance of being random even though the sample size is 3. Hint: it’s not random. This is a real problem that needs to be fixed in the next CBA; the model projects a 1,708 drop in average attendance this year relative to the average CI from 2003 to 2017. (This figure is highly volatile; a week ago kurtosis was -0.25 and the projected impact on next year was 1,030.)
You can see that even controlling for CI, the model shows an attendance decline. What’s behind that?
It’s not pace of play. While the average time between pitches rose fairly steadily from 35.1 seconds in 2003-4 to 38.7 in 2014, adding that (and its subsequent decline by about a second) to the model adds nothing (the correlation to the model’s error is -.027). The other independent factor for game length, PA/G, is even less significant.
The other factor is style of play.
There are three related factors that create the dotted / red line on the graph:
· The percentage of PAs that end with the batter making contact (Contact% or Con%)
· The percentage of batter contacts that result in a home run (HR/Contact or HRC)
· The percentage of non-contact PAs, other than HBP, that result in a strikeout. (Which is to say, SO/BB ratio, but not expressed as a ratio.) This turns out to be highly correlated to the first factor; the decline in the former and the increase in this are both the product of improved pitching.
For all of these factors, it is the previous year’s figure that is predictive. Adding the current year’s figure doesn’t improve the model at all. Since these figures (unlike Competitive Imbalance) shift gradually over time, there’s little difference between them in successive years. If the previous year’s style of play has a bigger effect on attendance than the present year’s, only the former will show up in the model. And while fans can recognize the wide year-to-changes in CI early in a given year, the gradual shifts in style of play from one year to the next don’t seem to become apparent until after the fact.
Let’s look at the three factors separately.
Here’s Contact% plotted on a second axis alongside the CI-free attendance model:
Post-steroid era Con% peaked at 74.4% in 2005. By 2018, it had fallen 8.4% to 68.2% (which won’t show up until we do the chart at the end of this year). That’s 4.7 fewer potential fielding plays per game. You see from the chart that the decline in Con% is the major driving force behind the attendance decline. The second half of this paper will track down the cause of that.
(I apologize for not having charts for the last two, less important factors. It’s trickier to figure out just what to show. I hope to include them in the eventual published version.)
You can also see that CI-adjusted attendance the last two seasons has been quite a bit better than expected given the further decline in Con%, which indicates an offsetting factor is in play. I’m sure you can guess what it is. The data makes it clear that fans love home runs, and without their recent surge, the Contact Crisis would be much worse.
From 2003 to the All-Star Break of 2015, HRC actually had a mild but significant (p < .02, r-squared = .41) downward trend, with a high of .039 in 2005 and 2007 and a low of .032 in 2014. The trendline runs from .038 to 034. (The cause of this decline is clearly not the removal of calendar pages; more on the causes in Part II.) As has been widely discussed, HRC leapt upwards from .035 to .040 after the 2015 All-Star Break and has subsequently been .044, .048, and .044, and is .053 so far this year. Relative to the prevailing downward trend, extra homers increased attendance by 1,194 fans per game in 2017 and 1,764 last year.
Even though home runs represent a very high percentage of online game highlights, I think it’s a mistake to think that the home run itself is what puts more fans in the seats. The possibility of home runs makes games more engaging and exciting because they make every game closer. If your team is down three runs late in a game but has two runners on base, your ratio of optimism to despair is entirely driven by how likely you think a home run might be. (In fact, as homers proliferate, the Leverage Index tables should be recalculated, and I’m not sure that they have been.)
Surprisingly, even though since 2005 it has become steadily more difficult to erase a lead by stringing a bunch of singles and doubles together, the value of this home run “hope factor” has remained constant. (Which is to say, there is no statistical interaction between HRC and Con% in the attendance model.)
(expressed as a percentage)
This one’s really counter-intuitive but makes perfect sense upon analysis. Excess strikeouts are the overall problem here, and yet, if given a forced choice between a strikeout and a walk, the strikeout is good for attendance.
Strikeouts as a percentage of non-contact PA (excluding HBP, which just add noise) rose steadily from 2003 (66.0%) to 2014-2015 (72.7%), which is just more evidence of the pitchers getting better. It fell back to 71.7% two years ago and has been at 72.4% - 72.5% since, which is either the batters making an adjustment and the pitchers responding, or random. In any case, this factor is strongly correlated to Contact % (r-squared = .73, p < .000025) and should be regarded as a second aspect of improved pitching.
Taken in isolation, the K factor is not a small one. From 2012 to 2016 it halved the damage caused by decreased contact. Last year and this, it’s still reducing it by 25%.
So, why are more strikeouts and less walks (if their total is a given) good for attendance? Well, here’s a related question: what’s the one event that fans in the stands will keep count of as the game progresses?
There is a fundamental asymmetry to fan reactions to strikeouts, one that derives from the essence of the game as reinvented by Babe Ruth. Strikeouts are the price that great hitters willingly pay for their power. A pitcher’s job is to make the batter pay that cost. It doesn’t matter how a batter makes an out, but there’s nothing more important to a pitcher’s success than to get strikeouts.
When a hitter on your team fans, it’s obviously a disappointment, but it does happen frequently, and you know that there will be a “next time” that will pay dividends; in the long run his hard hits will be easily worth their strikeout cost. But a strikeout of the opposing hitter? That is much more exciting to fans of the pitcher’s team than it is disappointing to fans of the hitter’s. You have dodged the bullet, and averted unlikely but potentially devastating disaster.
The positive outcome for the batter here is the walk, and that is much less exciting for fans of the hitter’s team than a strikeout is for fans of the pitcher. The walk by the pitcher is about as distressing for his fans as it is pleasant for the fans of the hitter; there’s no asymmetry at all.
The bottom line is this: have you ever seen a fist pump (or bat flip) for drawing a walk? Have you ever seen a walk (other than one that drives in the winning run) as a highlight clip? But strikeouts are probably third after homers and fielding gems on highlight reels. And they are about the glory of the pitcher, not the fault of the batter. They are exciting and dramatic and walks are not.
So one key upshot of this finding (OK, probably the only one): you cannot address the Contact Crisis by shrinking the strike zone. More walks are not what we want or need.
Let’s combine the annual effect on attendance of declining Contact% and increasing K/W and call it the Pitching factor of the Contact Crisis. We’ll calculate the impact on attendance relative to what we saw on average from 2004 to 2007.
For homers, we’ll calculate the impact on attendance relative to the average from 2004 to 2016 (which reflects homers from 2003 to the 2015 ASB). And for competitive imbalance, we’ll calculate the annual impact compared to the average through 2017.
It would be perfectly reasonable to think that the explanation for the Contact Crisis might be as complicated as the one for the decline in MLB attendance. Thankfully, this isn’t true at all.
What we’re concerned with here is entirely the decline in Contact %. What is driving that? I’ve already asserted that it’s better pitching, but how do we know that’s the whole of the story?
Because growing FB velocity and shrinking FB frequency (or growing off-speed frequency) explain 96% of the Con% decline, that’s why (adjusted r-square; p = 0.000007).
A wealth of extra pitch data is available, and it's possible that the model can be refined further.
We also have a separate source of pitch data, from BIS, that begins in 2003; from 2008 to 2015 (as far as I looked initially) it consistently reports more fastballs thrown with a lower velocity, so it is likely classifying some cutters as fastballs (even though they break in the opposite direction, a very simple discriminator). This data shows the same trends, however, and I intend to create a translation into the Pitch/fx / Statcast data used for the above chart, and dive deeper still.
Another project for the finished version of the paper is an explanation for the change in HRC. The decline in rate from 2003 to mid-2015 correlates a bit better to year (p < .02, r-squared = .41) than it does to SO% (.043, .32) or BIS FB velocity (.049, .31). I haven't yet found any factors that I can add to either of the last two to explain it, but there are plenty of possibilities left to explore.
I haven't yet looked at all into the rise in HRC since the introduction of the new smoother ball. It seems to be driven by the "launch angle revolution," and it would be great to get a full explanation of what's going on here, so that the HRC rate can be optimized when we succeed in returning Con% to a sane, productive, and above all exciting level.
 A methodological note: any study of general patterns of baseball play needs to start in 2003 when baseball began monitoring PED use. For example, since then there’s an almost perfect correlation (r^2 = .985) between Pitches/PA and SO%, BB%, and HB%. But that model projects a P/PA of 3.78 for 2001-2002 and the actual figure is 3.74, for 1999-2000 it’s 3.82 predicted and 3.75 actual. Why? That’s a question I’m resisting the urge to explore (could this fact be used to identify PED users by their changes in this relationship?)
Since the factors that determine attendance include last year’s style of play, the model starts in 2004.
 The kurtosis of a distribution is more or less its flatness compared to a normal distribution. In the four years with the lowest kurtosis, 97% of teams won between 64 and 98 games and none won fewer than 54 or more than 103. In the four years with the highest, the figures are 85% and 3%, respectively).
 The model uses the 2015 pre-ASB rate as the Year-1 figure, which doubles its statistical significance, from p = .000030 to p = .000015. That tells you that fans unconsciously regarded the late homer surge as a fluke.