Number crunching and its uses

bankshot1

Member
SoSH Member
Feb 12, 2003
24,798
where I was last at
I'm 63, grew up when the BA-HR-RBI  W/L  Ks, BBs,and ERA were the royalty of baseball stats. We knew who was a good player/pitcher generally by looking at the back of a baseball card. I think that's general true today, but I have my doubts. How many other levels of analysis are out there, are they relevant, and do they really add that much more to the understanding of the game? 
 
I'm a #s guy, I like the innovative use of statistics. Professionally and in other pursuit. 
 
I read my first Bill James book in the late 80s and was hooked, loved Money Ball in '03, and strongly suggested it be required reading for junior analysts I was managing. I'm pretty open minded about ways to analyze things.
 
But there is only so much time (now) I want to "relearn" so as to better relate to baseball. I"m all in for OBP, WHIP,  WAR, etc and i read some of the more quant threads here to get a clue. But at times my eyes quit.
 
Just the way it is.
 
in any case I saw this today in Carfado's Notes column:
 
“The average distance of the balls hit into the air last season against Jon Lester was 238 feet, the shortest of any pitcher in 2014.”
 
and four things lept to mind,
 
1) kids at MIT have to much free time on their hands
2) they should be getting laid more often
3) I should be getting laid more often
4) Will Joe Maddon adopt a "Lester shift"?
 
And the last observation is serious.
 
So is the 3rd.
 
And if he doesn't play his OF in more when Lester pitches, shouldn't he?
 
So feel free to dump on the old guy, or tell me about useful statistical analysis that won't melt my eyeballs, or whatever.
 
 
 
 
 

Al Zarilla

Member
SoSH Member
Dec 8, 2005
59,307
San Andreas Fault
bankshot1 said:
I'm 63, grew up when the BA-HR-RBI  W/L  Ks, BBs,and ERA were the royalty of baseball stats. We knew who was a good player/pitcher generally by looking at the back of a baseball card. I think that's general true today, but I have my doubts. How many other levels of analysis are out there, are they relevant, and do they really add that much more to the understanding of the game? 
 
I'm a #s guy, I like the innovative use of statistics. Professionally and in other pursuit. 
 
I read my first Bill James book in the late 80s and was hooked, loved Money Ball in '03, and strongly suggested it be required reading for junior analysts I was managing. I'm pretty open minded about ways to analyze things.
 
But there is only so much time (now) I want to "relearn" so as to better relate to baseball. I"m all in for OBP, WHIP,  WAR, etc and i read some of the more quant threads here to get a clue. But at times my eyes quit.
 
Just the way it is.
 
in any case I saw this today in Carfado's Notes column:
 
“The average distance of the balls hit into the air last season against Jon Lester was 238 feet, the shortest of any pitcher in 2014.”
 
and four things lept to mind,
 
1) kids at MIT have to much free time on their hands
2) they should be getting laid more often
3) I should be getting laid more often
4) Will Joe Maddon adopt a "Lester shift"?
 
And the last observation is serious.
 
So is the 3rd.
 
And if he doesn't play his OF in more when Lester pitches, shouldn't he?
 
So feel free to dump on the old guy, or tell me about useful statistical analysis that won't melt my eyeballs, or whatever.
 
 
 
 
Maddon would first have to look at Lester's line drive percentage, wouldn't he? Or is Cafardo not including line drives as balls hit into the air? And what's the cutoff between a line drive and a hard hit fly ball anyway?
 

Snodgrass'Muff

oppresses WARmongers
SoSH Member
Mar 11, 2008
27,644
Roanoke, VA
Al Zarilla said:
Maddon would first have to look at Lester's line drive percentage, wouldn't he? Or is Cafardo not including line drives as balls hit into the air? And what's the cutoff between a line drive and a hard hit fly ball anyway?
 
From Fangraphs:
 


Our batted ball data goes back to 2002, but it’s important to remember that there is no perfect way to define each type of batted ball so some balls that you might consider a fly balls might get classified as line drives and vice versa. In reality, batted balls exist on a continuous distribution from rolling perfectly on the ground to being launched straight up in the air. The cut points between the three classifications are somewhat arbitrary and imprecise, so do not treat the data as infallible.
 
http://www.fangraphs.com/library/offense/batted-ball/
 
So it's a bit subjective. I imagine teams have more precise criteria involving the speed off the bat, trajectory and distance the ball travels in the air. That kind of data is tough to get to, though.
 

Al Zarilla

Member
SoSH Member
Dec 8, 2005
59,307
San Andreas Fault
Snodgrass'Muff said:
 
From Fangraphs:
 
 
 
 
http://www.fangraphs.com/library/offense/batted-ball/
 
So it's a bit subjective. I imagine teams have more precise criteria involving the speed off the bat, trajectory and distance the ball travels in the air. That kind of data is tough to get to, though.
Thanks. I could have looked that up. Lazy day Sunday here. I Googled to try to find Cafardo's column and this thread came up second. Google is fast and all, or extremely inclusive.
 
https://www.google.com/search?q=Carfado%27s+Notes+column&rlz=1C1CHMO_enUS569US569&oq=Carfado%27s+Notes+column&aqs=chrome..69i57.731j0j4&sourceid=chrome&es_sm=122&ie=UTF-8 
 

Rice4HOF

Member
SoSH Member
Jan 21, 2002
1,900
Calgary, Canada
bankshot1 said:
...in any case I saw this today in Carfado's Notes column:
 
“The average distance of the balls hit into the air last season against Jon Lester was 238 feet, the shortest of any pitcher in 2014.”
 
And if he doesn't play his OF in more when Lester pitches, shouldn't he?
 
My answer isn't baseball related and doesn't help answer your questions, but just a general statistics warning:  ALWAYS be careful when using "averages".  An average distance of 238 could mean that most balls were hit between 237 and 239 feet.  OR it could be mean that half were 120 feet and the other half were 360 feet.  Without other numbers (distribution type, standard deviations etc.), that number by itself is completely useless for decision making purposes.  Something like a spray chart showing where all the balls were hit would be much more useful for defensive shifts.
 

ivanvamp

captain obvious
Jul 18, 2005
6,104
If you wanted a quick and easy (i.e., simple to understand) set of numbers to look at and learn something useful about the quality of, say, a hitter, what would the best set of three be?
 
For me, the three pieces of information I want to know are:  (1) How good is he at getting on base?  (2) Is he a singles, gap, or power hitter?  And (3) Is he a run producer?  I admit that last piece may be a bit antiquated, but so be it.
 
- OBP - percentage of time this batter gets on base - the most fundamental aspect of hitting, I think.  I know if a guy has an obp of .311, it's not very good, but if it's .391, it's terrific.
 
- Runs Produced - runs + rbi - HR.  You score 90, drive in 85, and have 20 homers, that's 155 runs produced, which is pretty solid.  200 is tremendous.  60 is not so good.  I know this stat is dependent on your teammates, but baseball is a team sport after all, and while so much can be broken down individually, the reality is that your teammates DO affect what you do.  For example, you have a slow runner in front of you.  You hit a shot down the left field line.  Your slow teammate takes second, but because he's slow, he is held there, limiting you to a single.  But otherwise, if he wasn't on base, that would have been an easy double for you.  So I'm ok with a teammate-dependent stat being part of a player's evaluation (like an assist in basketball).  
 
The last one is a tricky one for me.  Part of me wants to go with slugging percentage, but that really doesn't tell me if a guy hits a lot of homers or a lot of doubles.  And it's hard for me to know off the top of my head what a really good SLG is.  I also like ops+ because it takes into account the era in which a player played, and his home park effects. But ops is kind of a strange stat; it's not at all clear to me why obp and slg should be weighted equally.  Why isn't it O(2)PS?  It's easy to look at and tell who is good at this stat, but it's not clear to me why that stat is a good stat to use.  Any of the RC or WAR stats are formulas that most fans cannot possibly wrap their heads around.  So I'm simply going to go with home runs.  Power numbers are still a huge part of the game, and a player's HR number, combined with these other two stats, is pretty helpful for me, anyway.
 
So a player with these numbers:  .367 OBP, 31 HR, 177 RP is really good at getting on base, has terrific power, and is a great run producer. 
 
A player with these numbers:  .315 OBP, 7 HR, 59 RP isn't really good at any of those things.
 
A player with these numbers:  .381, 11 HR, 150 RP is very good at getting on base, doesn't hit for much power, but is good at producing runs, and I would wager based on the power number that most of the runs produced are in the runs category more than rbi.
 
A player with these numbers:  .320 OBP, 29 HR, 131 RP isn't great at getting on base, hits with very good power, and is a fairly good run producer, especially given his low OBP.
 
I know this is a very incomplete list, and there will, of course, be players who don't fit neat little categories.  But if I'm devising a set of back-of-the-baseball-card stats, these three would be on there for me.
 

rotundlio

Member
SoSH Member
Jul 8, 2014
323
I've been going full bore on a fantasy league the past few weeks. I mocked up a cursor with miniscule baseball laces, outfitted the home page with CSS and a flash game, Photoshopped this when we got Moncada. I've got two spreadsheets' worth of personal projections and notes as compared to FantasyPros and ESPN rankings. There's a sticky note on my monitor displaying league average rates for o-swing, whiffs, first pitch strikes, etc. My opinion is still worthless, but I feel almost as though it's my turn to bat here.
 
Ascertaining the tangible value of a quarterback in football is impossible. Take Ryan Tannehill. Fifth in completion percentage, fifth in success rate. 14th in QBR, though; 18th in expected points added per play; 28th in yards per attempt. His receiving corps (tenth-dropsiest) and run game (second in yards per carry) must be taken into account. Moreover, it took him months to fully acquaint himself with Bill Lazor's schematic. From Week 7 onward Tannehill's QB rating was among the best in football. To top it off, PFF graded the Dolphins O-line as being worst in the league for a second straight year. Tony Romo will attest, a good line is paramount.
 
So is Tannehill top-ten? Who the hell knows? Football is so inherently complex that we might as well judge quarterbacks by wins and rings.
 
(As an aside, I think now's the worst time to be investing heavily in middling QBs. I'm anticipating an influx of real-life J.D. McCoys real soon who will all have been running some variant of the spread offense since they were six. I expect the floor for QB play to rise like it has for relief pitching in baseball. They'd be running back fungible!)
 
Six-hour NASCAR races complete with pit stops are determined by fractions of a second. If you dive to the middle line and nobody follows you there, you'll plummet like a stone. And if a teenager eight rows ahead clips another driver, your day and the days of ten others will be ruined through no fault of your own. It's impossible to win in NASCAR without a healthy dosage of luck, almost comically so.
 
Baseball isn't like these games. Baseball is intensely individual. Pitchers pitch, batters bat, repeat ad infinitum, and it's super awesome. It lends itself to statistical scrutiny like no other. National elections, binary as they are, can be extremely easy to prognosticate. Baseball is much further toward this side of the spectrum. Statistical analysis is my spirit animal. I wouldn't love the sport half as much without it.
 
Baseball analysts can control for every variable in precisely the way football analysts cannot. If Pedroia doubles with none on and two down, historically that would have been worth about a fifth of a run. One out, runners on second and third, home team trails by a run in their half of the seventh. A strikeout, and their win probability (derived the same way, from mathematical permutations and thousands of previous outcomes) would fall from 56% to 40%. We know this because baseball and rigorous analytics of said baseball are two sides to the same coin. Baseball is math, more readily so than the other team sports.
 
So I put a lot of stock into weighted runs created plus (wRC+) and skill-interactive earned run average (SIERA). They're the two best freely available metrics that consider component parts under this microscope. There were 8,137 doubles in 2014 worth on average  this much . So Pedey's, which bounced over the right field wall in this imagining, would have  this  concurrent impact on wRC+. We've discerned that pitchers cannot control certain things—batting average on balls in play for one, homerun-to-flyball ratio for two. So SIERA approaches that which they can influence by case—K rate, walk rate, groundball-to-flyball ratio—and spits out an exacting number scaled to match league average ERA. SIERA's a much better predictor of future ERA than is present-day ERA itself! Park- and league adjustments are applied to each.
 
Incidentally, here are the scoring settings for that league I mentioned.
 
In lieu of the hyperkinetic NASA stats over which Cherington pores in my imagined Sox front office—batted ball velocity, infielder range factor in meters, flyball trajectory, bat-to-ball proximity on sliders while behind in the count—these are my best bet. I look at hitters' wRC+ and pitchers' SIERAs and try to divine from context which direction they're likely to head. If a hitter sees an inordinate number of first pitch strikes, for instance, that would explain a low walk rate. Because sabermetrics excel in another facet: controlling for luck.
 
If pitchers have no dominion over BABIP and HR/FB%, and in a huge majority of cases they won't, then it stands to reason that any pitcher with a worse BABIP than league average experienced some bad luck. Strand rate, the percentage of baserunners allowed who fail to come around and score, is similar. It's been shown to hover within a few ticks of 74%. If a pitcher fails to strand three quarters of baserunners, he's unlucky and his ERA is substantially inflated.
 
Infield hit percentage variance can be impactful. Giancarlo Stanton muscled his way to a .288 average this season, partly on the strength of 13 infield hits. Somehow Bigfoot legged out a proportionally higher percentage of grounders than did Kinsler, Adam Jones, or Ben Revere. Control for his career figure of 6.2% or something I cleared the calculator and his average reads, ".280."
 
I've even devised three of these things on my own which I hadn't seen elsewhere, all for pitchers. The first is leadoff OBP versus OBP overall. Leadoff hitters don't always, but typically will carry the most sway in the course of an inning. (Hearkening back to Pedey's double: if it had occurred with none out, it would have been worth .6 runs.) If you perform worse than usual against them, you're probably experiencing some degree of bad luck with sequencing. The second is home start percentage. One in ten pitchers making 35 starts will have 22 or more take place on the road. This is significant because pitchers across the league will perform appreciably worse during road games. The third is RBI/HR. I don't know the league average figure, but I'm certain RBI/HR would be useful if I did. Carlos Carrasco allowed seven home runs last season, five solo, two two-run.
 
So you see, advanced stats are most definitely illuminating.
 
Case in point: Johnny Cueto. He sucks and the Reds are idiots for not having traded him. Motherfucker fell into their lap and they dealt Latos instead. Cueto posted a 2.25 ERA last season and the conventional wisdom even round these parts is that he's a bona fide ace. He is not. He posted the second-highest strand rate in baseball and the lowest BABIP. His SIERA, for its part, is a full run higher—an outlier career best, 3.15. But the SIERA is heavily influenced by a top-ten K rate, easily his best ever, and an above-average walk rate—which are ludicrously unsustainable for a guy who throws the sixth-fewest strikes of anyone, with average chase and whiff rates. Cueto isn't great(o). I want him least. We have enough number three types. If Cueto's markedly above average next season I won't post on this site again
 
 
Edit:  Pitch framing metrics are new and momentous. This is what 97 looks like, and turning balls into strikes is an immensely valuable trait. Baseball Prospectus posits that Christian Vazquez is worth five wins (48 runs) above average over the course of a 10,000 frame season.
 

kieckeredinthehead

Member
SoSH Member
Jun 26, 2006
8,635
As with most stats, the most important question is whether it's descriptive or predictive. That's why RBIs falls down as a star. It tells you what happened, but it doesn't do a great job predicting how many runs the player will drive in the following year. That average distance stat would have me worried if I were Theo because it could very well have been a fluke.
 

charlieoscar

Member
Sep 28, 2014
1,339
Re: rotundlio's comment on Stanton's 13 infield hits "inflating: his batting average. Actually, I have him with 14 using Retrosheet data, 8 of which were handled by the third baseman.
 
While using the advanced stats for present day baseball helps one make better assessments of players, a major problem is that many of them can't be applied them historically as full play-by-play data isn't available for almost all season prior to 1974, pitch-by-pitch data only goes back about 20 years, and Pitchf/x data didn't come into full play until 2008. What has been developed to measure fielding in recent years is far better than the once standard of fielding average but until the new StatCast is up and running, it still is not a strong estimator.
 
The Hidden Game of Baseball by John Thorn and Pete Palmer, ca. 1985, is a history of baseball stats and introduces Palmer's Linear Weights formula but it also ranks various stats on their predictive value. It's dated but well worth reading if you can find a copy.
 

nvalvo

Member
SoSH Member
Jul 16, 2005
21,680
Rogers Park
Al Zarilla said:
Thanks. I could have looked that up. Lazy day Sunday here. I Googled to try to find Cafardo's column and this thread came up second. Google is fast and all, or extremely inclusive.
 
https://www.google.com/search?q=Carfado%27s+Notes+column&rlz=1C1CHMO_enUS569US569&oq=Carfado%27s+Notes+column&aqs=chrome..69i57.731j0j4&sourceid=chrome&es_sm=122&ie=UTF-8
 
Even more interestingly, I've heard it suggested that confirmation bias may make those "fliners" that fall as hits more likely to be described as line drives, and those caught more likely to be categorized as fly balls by whichever STATS, Inc. interns put those numbers together.  
 
If that's true, it's a serious source of error for xBABIP and similar stats. 
 

alannathan

Well-Known Member
Lifetime Member
SoSH Member
Jan 10, 2007
216
Champaign, IL
If we had accesses to batted ball data, either from HITf/x or from TrackMan (the latter being part of the new StatCast system), we could come up with precise definitions of line drive, fly ball, popup, etc. based on the combination of batted ball speed and vertical launch angle.  Unfortunately, we don't have access to such data.