Figuring Out Pitch Modeling Metrics (Stuff+)
The flavor of the week is Stuff+. A year ago, very few people had heard of it. But now, everybody is talking about. Including me!
The problem I see is that everybody is talking about and using it, but as far as I can see nobody as ever really studied what these stats mean for real-life results (and therefore, fantasy results).
What I’m talking about are questions like
Does high Stuff+ result in more strikeouts?
Does low Stuff+ results in fewer strikeouts?
What is more descriptive and predictive of ERA - Stuff+ or Pitching+
How should we use these stats in making fantasy baseball decisions
So I’m here to answer these questions. As I’m writing this sentence right now, I don’t know the answers. Most of the reason I’m writing this is for me to figure stuff out, and you’re being taken along for the ride.
So let’s go category by category and see the relationships.
SwStr%
We want strikeouts from pitchers, and to get strikeouts, you want to get whiffs. So the question is - how do these pitch modeling metrics relate to getting whiffs? To find out, I’ve taken the 2022-2023 data and rounded all the pitch modeling metrics (stuff+ to the nearest 5, pitching+ and location+ to just the nearest whole numbers). So a pitch with a 117.3 stuff+ goes to 120, and a pitch with a 102.8 pitching+ goes to the nearest 103), and that helps us group pitches together and then compare that number to how often they get a whiff. The reason for the difference in rounding is that the spread of stuff+ is much wider than pitching+ and location+. Most pitching+ marks are pretty close to 100 so we just want to round them to the nearest whole.
Thinking about this a little more, I think the models are actually trained on actual results. I could be wrong about this, but I think I’m right. So what that means is that the model looks at the data and sees all of the pitches and all of the data about them and then seees which pitches tend to perform better and worse. I’m sure getting a whiff is part of that, so it is not surprising to see a really strong relationship here.
Basically, I’m just being redundant by showing you this, if I understand the models somewhat accurately. A high Stuff+ mark on a pitch would be that way because similar pitches have gotten a lot of whiffs in the past, so yeah it’s like “no duh” that high Stuff+ means high SwStr%.
But I don’t think many people even understood the model that far - so this is still quite useful to see.
Let’s look at Location+ and Pitching+:
More of the same there. All of these metrics are highly correlated with SwStr%, with Stuff and Pitching+ being the strongest.
Correlation Coefficients for SwStr%
Stuff+: .98
Location+: .84
Pitching+: .94
Zone% / Swing%
Location is in the models as well. The best Stuff+ pitches are not in the strike zone, I’m guessing they are just right barely outside of it since those would be the most effective pitches:
As for Location+, the worst pitches are not in the zone (obviously), and the best ones certainly are, so more of the same there:
Pitching+ looks almost exactly the same:
As for generating swings:
Little bit of wavey action going on here. I guess that just means that a very good pitch will still get a lot of swings because it will be in the strike zone, and in-zone pitches get more swings. But the best, most elite pitches, won’t get as many swings because the hitter just knows there’s nothing he can do with it - or something like that. Again, I’m learning along with you here.
xwOBACON
Expected wOBA on Contact
Since we now know about swings and whiffs, we can just see what happens in the events that the ball is put into play.
Not surprising here, much in the same way as SwStr%. I’m guessing that quality of contact is also an input into the Stuff+ model, which would mean that if you just reverse it like we’re doing now we were bound to see this. But yes, the better stuff+, the worse the quality of contact.
As for location:
Widens out a bit the bottom. Probably because a lot of these horribly located pitches are way out of the strike zone so if you put one in play it’s not going to go very well for you.
Same thing with Pitching+.
If you’re throwing a pitch with a 130+ Pitching+, it’s very unlikely to get hit hard.
Fantasy Points
This is probably the most important one. You’re reading this blog, which means you probably play fantasy baseball. If you play fantasy baseball, your goal is to score fantasy points. And yes, even if you’re in a roto league - obviously points scoring is highly correlated with roto success, so don’t give me that!
What I did here is look at every pitcher outing of the last two seasons (where the pitcher faced at least 15 batters) and compared each of the pitch modeling metrics with their FANTASY POINTS SCORED PER BATTER FACED. Doing it that way gives us a rate so the guys that went 8 innings don’t have a huge advantage over the guys who only went four or something.
Stuff+
Correlation: .29
Not a strong relationship here. Having high Stuff+ in an outing does not mean you’re going to perform well in the box score.
I think the relationship gets fuzzied by the fact that box score results are pretty random. When we’re talking about fantasy point scoring, a bloop double is just as good as barrel’d double, so that variance comes into play here. What you can see is a slight general upward trend. You don’t see any dots in the bottom right or top left of the plot, meaning that yes if you had to choose - you want the high Stuff+, but it just doesn’t guarantee anything - not at all.
Location+
Correlation: .17
Even less of a relationship here, in fact, I would say this relationship just doesn’t exist.
Pitching+
Correlation: .33
This is the strongest correlation of the bunch, but it’s still not highly predictive.
Let’s take an example.
If we look at each outing that the pitching put up a 105+ pitching+, their average FPts/BF was 0.72. This would result in 13 fantasy points (all of this is using the DraftKings scoring system, by the way) if you faced 18 batters (facing the lineup twice).
If we go down to a 95+ pitching+, the average is .54, which would result in 9.7 fantasy points, so that’s significant.
Doing it that way with Stuff+, it’s pretty similar. An average Stuff+ score of 120 results in 13.9 fantasy points over 18 batters faced, and if we drop it to 80 that goes down to 10.2.
So there’s something here. Stuff+ and Pitching+ are both useful in this regard.
Comparing all of this with my favorite pitching metric of strikeout-to-walk ratio (K:BB).
K:BB beats all of those with a correlation of .58.
That should be expected because a strikeout is actually a strikeout and a walk is actually a walk. Those two things directly raise or lower fantasy points, while the pitch modeling metrics only function to predict strikeouts or walks - so it’s another step removed so obviously the relationship will be weaker.
But how correlated is K-BB with the pitch modeling numbers?
Here are the correlations:
Stuff+: .57
Location+: .15
Pitching+: .66
Another win for Pitching+, but it’s relatively close.
One final thing to talk about is how sticky these metrics are. A statistic doesn’t do us any good if it doesn’t tell us something about the future, so if your Stuff+ one day is random and doesn’t say anything about what your Stuff+ will be next week, it’s useless to us.
That doesn’t turn out to be the case, of course. I took the dataset (data from 2022 to present day) and divided it in half and then measured each pitcher’s pitch modeling numbers in both samples to see how predictive the first number is of the second. Here are the correlation coefficients:
Stuff+: .94
Location+: .72
Pitching+: .79
So Stuff+ is the most steady. It’s very unlikely for a pitcher to have good Stuff+ for a couple of starts and then have bad Stuff+ in the future. The other two are also pretty sticky as well (anything above .60 or so will be pretty sticky and it gets stickier and stickier as it approaches 1), but not to the same level as Stuff+.
So my conclusion here I think is that Stuff+ and Pitching+ are both pretty evenly useful.
Pitching+ is a little bit better at predicting box score results, but since it’s less predictive of its future self, then that brings Stuff+ closer to it since we can trust Stuff+ more. I hope that makes sense. We want pitchers with high marks in Stuff+ and Pitching+, and we should indeed prefer to not play hitters against pitchers with good marks in those categories.