Assorted Basketball Analytics Musings and Findings

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
Ever since I got a real job, I haven't had time to do much work in this field, but I thought a thread where people post/discuss assorted basketball statistics findings/topics would be of interest. The one thing I saw today which caught my eye was this by Jeremias Englemann, the creator of ESPN's RPM stat:
 
 

 
Essentially, what he's doing is taking existing data about player quality and then breaking it down by how many consecutive minutes a player has played without leaving the game. What he finds is that players take a couple of minutes to "warm up", during which time they perform significantly worse than normal. Quality of play continues to improve until 8 minutes in offensively, and 6 minutes in defensively, at which point it starts to sag off.
 
By using existing player quality data, he's adjusting for the selection bias of which players are selected to play (i.e. this Chris Smiths of the world playing 2 minutes in garbage time aren't having an outsized impact here, since the model already knows that they stink).
 

Brickowski

Banned
Feb 15, 2011
3,755
These numbers just seem to validate what most good coaches know by the seat of their pants. The 7-8 minute mark of a quarter is when coaches start to filter in the bench players.

I would be interested to see these charts for stars who are used to playing 40+ minutes a game vs players who are not. It's a conditioning issue, isn't it?

I'd also love to see a chart like this on someone like Havlicek, who was bionic.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
Devizier said:
My guess is they use the 5 man lineups posted in the box scores.
They're using box score data, and play-by-play data to see who was on the court at a given time. They are not tracking every game by watching it, no.
 

luckiestman

Son of the Harpy
SoSH Member
Jul 15, 2005
32,776
That is what I thought by looking at the csv files of pbp data
 
I don;t know how good the stat is for a guy like Bradley or any individual really. Seems like a good team stat
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
luckiestman said:
That is what I thought by looking at the csv files of pbp data
 
I don;t know how good the stat is for a guy like Bradley or any individual really. Seems like a good team stat
It's not an ideal individual defensive stat, no. My "go-to" stuff for defense tends to be the xRAPM/RPM stats, which are +/- based. I then check them against dRtg and a couple other similar box score stats (ASPM in particular) to see what they think. If they're in agreement, then I think it's a pretty good guess. If they're not, then it's a sign to look deeper. Bradley is one of the "look deeper" guys, as he actually graded out well by xRAPM this year, but I'm skeptical of his utility on the court for most teams.
 

wutang112878

Member
SoSH Member
Nov 5, 2007
6,066
Is there anywhere to download players stats?  Basically I'd like to setup a database that has all of the stats on basketball reference for each player, instead of having to look player by player
 

Devizier

Member
SoSH Member
Jul 3, 2000
19,566
Somewhere
wutang112878 said:
Is there anywhere to download players stats?  Basically I'd like to setup a database that has all of the stats on basketball reference for each player, instead of having to look player by player
 
sortof. You can do a search on bb-ref and then click csv on the upper right of the table that pops up. They also have tab format (which transfers to excel better)
.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
There are a couple places- what format are you looking for? Sorted by player, by year, etc...
 

wutang112878

Member
SoSH Member
Nov 5, 2007
6,066
Any easily readable format like excel or csv would be perfect.  As for what specifically.  If you go to Avery Bradley's basketball reference page they have a table for Totals, Per Game, Per 36, Per 100 POS, Advanced and Shooting.  I want that table for every player for say the last 20 years, so I can put it into a database to write a query to find things like 'there are 1,000 SGs who played for 10+ years and 50 improved their 2pt FG% and lifted their career average after their 4th year'.
 
I didnt even think to use the bbref player finder, the only issue is to get 20 years of data I have to run each year which has like 500 players and you can only download 100 players at a time and I need to download all 4 tables they have available (and they dont have shooting :( ) and then do that 20 more times and hopefully not make a copy and paste error that corrupts a year.  I try to save all my menial work patience for my day job, so if I can avoid that tedious process that would be pretty awesome.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
Would this work for what you're looking for? Just need to do 20 copy/pastes into an excel sheet, and you'll have 20 years of data? I can think of a couple slightly annoying ways to use that for what you want to do.
 
Database Basketball also has some good data.
 
Finally, I've had very good experiences just asking Sean Forman for custom datasets. He's super helpful. If you want a CSV of every free throw Dwight Howard has ever taken based on time left in the game, opponent, and score in order to run some "clutch" theories, he'll oblige, etc...
 

luckiestman

Son of the Harpy
SoSH Member
Jul 15, 2005
32,776
Anyone looking at game to game consistency over time to see what mean players revert to if at all?

What I mean by this is you have guys that get better but how do they do it. Let's think about 2 cases let's say you play bad games and good games; the average would be medium or you just always play medium.

Getting better could mean you eliminate bad games or that you just improve your medium game.

If you were scouting would you want a guy with high peaks and valleys and hope to eliminate the valleys or would you want a guy that you know what you're going to get and think it can get better over time.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
The one thing I've seen on that was for a specific player (Durant), but I haven't seen a broader examination.
 
Adjusting for defensive quality is going to be the biggest issue with this sort of thing.
 

wutang112878

Member
SoSH Member
Nov 5, 2007
6,066
bowiac said:
Would this work for what you're looking for? Just need to do 20 copy/pastes into an excel sheet, and you'll have 20 years of data? I can think of a couple slightly annoying ways to use that for what you want to do.
 
Database Basketball also has some good data.
 
Finally, I've had very good experiences just asking Sean Forman for custom datasets. He's super helpful. If you want a CSV of every free throw Dwight Howard has ever taken based on time left in the game, opponent, and score in order to run some "clutch" theories, he'll oblige, etc...
 
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
In line with the request to refocus some of the broader statistical discussions taking place across threads,  I thought I'd post this link here. That is a paper about a variant of xRAPM called SPR, but is essentially the same thing. (It is a ridge regressed plus/minus stat that also incorporates box score data to further increase predictive accuracy). The authors of that paper show that SPR produces results that beat Las Vegas point spreads for out of sample data, no easy feat as you can guess.
 
While SPR isn't publicly available to my knowledge, it does not appear to differ materially from the RPM/xRAPM stats out there. (I don't see any mention of a height prior in SPR is all). Insofar as people are curious if these plus/minus stats are meaningful, besides the fact that NBA teams consider them extremely important, that's a pretty good sign that they're doing something right.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
I'm responding to this post here, just to reduce clutter:
PedroKsBambino said:
To be clear, no one has suggested perfection is the goal, or the expectation. The concern expressed by two of us was specific, and likely material (though we can't be sure of course) if it hasn't been addressed. It may have been; you don't seem to know which is fine---I surely don't either.
A poster actually did suggest a lack of perfect as an issue ("That is a reasonable approach; however, it is far from a perfectly accurate one, too"), but given we all seem to be in agreement that it's not, I'm happy to move on.
 
What are your concerns specifically? I still honestly don't understand them. luckiestman's issues I can't really deal with, since he hasn't explained how it relates to something like this (he's posted interesting links, but I'm not smart enough to see how they relate here).
 
Your concerns I'm closer to grasping, but we seem to be talking past each other a bit, so maybe you can explain your concerns again?
 
1. Is your issue simply that performance is non-linear? (i.e. on good teams with plenty of outside shooting, Rajon Rondo is a +2 player, while on a bad team, he's neutral?). Is this what you were getting at with the "slow" player example? Phrased another way, "fit" is a real phenomenon, and players are not simply a collection of plusses and minuses.
 
If so, this is both obviously true and important, and no, xRAPM isn't doing much to deal with this issue. This isn't the sort of thing xRAPM can fudge - this would require a radical makeover of xRAPM. There are three parts of xRAPM that midly (but really only mildly) deal with this. 1) They use box score data - insofar as a player is putting up good/bad numbers, but the rest of his team is totally concealing their effect, this will be corrected for slightly; 2) There is data about height and position included (which hints at "fit"); 3) The regression takes into account that teams with a big lead tend to ease up, and that teams way behind tend to catch up - insofar as certain players are only coming in during certain game situations, that will be captured.
 
But generally, no, "fit" is not accounted for. If the Mavericks assembled Patrick Beverly, Kyle Korver, Andre Iguodala, David West, and Joakim Noah, that would rate as one of the best starting 5s in the NBA (maybe the best). But potentially, they'd never get a shot off, and lose every game 60-55 or some other bizarrely low score.
 
In practice, this isn't such a big problem. Teams take fit into account, and to my knowledge, so does everyone using xRAPM as a player assessment technique. Furthermore, even in the abstract, it's hard to know how much "fit" matters when the talent is there. That starting 5 above might be great for all I know because the talent is so good - they might win every game 70-60 alternatively in other words.
 
Either way, as I posted above, the proof is in the pudding. xRAPM techniques yield extremely accurate forward looking results. If this "fit" issue were especially important, this that would not be true. But no, "fit" is not dealt with.
 
2. You posted another issue about "we need to recognize that outliers (such as Faverani's bad-ness) will tend to get washed out to a larger degree than they probably should" - what did this mean? I thought you were referencing regression to the mean, but that seems not to have been the case. What did you mean?
 
What other questions do you have? I'm pretty familiar with just about everything in xRAPM (although I can't replicate it - I can only do RAPM, without the box score/height data), so I'm happy to help clear up some questions.
 

luckiestman

Son of the Harpy
SoSH Member
Jul 15, 2005
32,776
bowiac said:
In line with the request to refocus some of the broader statistical discussions taking place across threads,  I thought I'd post this link here. That is a paper about a variant of xRAPM called SPR, but is essentially the same thing. (It is a ridge regressed plus/minus stat that also incorporates box score data to further increase predictive accuracy). The authors of that paper show that SPR produces results that beat Las Vegas point spreads for out of sample data, no easy feat as you can guess.
 
While SPR isn't publicly available to my knowledge, it does not appear to differ materially from the RPM/xRAPM stats out there. (I don't see any mention of a height prior in SPR is all). Insofar as people are curious if these plus/minus stats are meaningful, besides the fact that NBA teams consider them extremely important, that's a pretty good sign that they're doing something right.
It is interesting. I will say that using metrics for team performance is less worrisome than trying to rate individual players. Why? because even if we are biased on the individuals, as long as we are biased in a consistent way, things should be ok. Sort of like survey data. people might lie on surveys but we can still get good info from those surveys because over time people probably lie the same way.
 
A way more important point: If I had a system that could beat vegas you can bet your ass I wouldnt post it online. 
 

luckiestman

Son of the Harpy
SoSH Member
Jul 15, 2005
32,776
I dont know xRAPM can you just write the basic version of the equation? 
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
luckiestman said:
I dont know xRAPM can you just write the basic version of the equation? 
It's much too complicated for that. For me to run RAPM I need to use excel & R, and even that takes a surprisingly long time. However, the basic idea is to run a regression on every lineup combination that exists.
 
So for instance,  Game 6 of Heat-Spurs opened with an 8-0 run by the Heat. That sequence gets coded as Duncan+Leonard+Parker+Green+Diaw - (LeBron+Bosh+Allen+Chalmers+Lewis) = -8. Then when a substitution happens (Manu comes in for Danny Green), we get a new equation that covers until Splitter comes in for Duncan, which reflects a 10-5 Heat run, so that gets coded as Duncan+Leonard+Parker+Ginobili+Diaw - (LeBron+Bosh+Allen+Chalmers+Lewis) = -5. Etc... Run that for the entire NBA season, and get something like 40,000 combinations over the course of the season, each reflecting a different set of 10 players and a different score result. Then run a linear regression solving for each player's value in the above equations, and you have APM.
 
APM isn't great. The linked paper looks at it, and it does not so hot. Sebastian Telfair was the 5th best APM player this year. DeMarre Carroll was 6th. Metta World Peace was 12th. It's not total garbage however - just that very simple process does grade out the top 3 players as being LeBron, Paul, and Durant for instance. The key problem is co-linearity, where some guys almost always play together, and so you don't get a good dataset. Put another way, if Bradley is never on the court without Rondo, then APM can't tell them apart. They're really just one variable. This is rare of course because of injuries, fatigue, whatever, but it can lead to small samples for some players, which gives some weird results.
 
RAPM is Regularized APM. That is basically APM, except with a ridge regression. Basically, instead of the standard residuals you're left with when doing a regression, you subtract out a beta factor from each residual to incorporate essentially a Baysesian prior. RAPM uses a beta of the last year's data. This is basically just increasing the sample so as to decrease co-linearity. With 2 years of data, you have more lineup changes available, so not too many players are "bundled" like the Rondo/Bradley example above. RAPM has strong predictive accuracy - beating any other model other than xRAPM in predicting team results for instance.
 
xRAPM is Expected RAPM, which is what SPR is. This is basically just RAPM, except the data is further adjusted to reflect non-plus/minus things we know about the players, so as to further increase predictive accuracy. So if Rondo and Bradley only play together for 2 years, but Rondo shoots great, gets a ton of assists, while Bradley stinks it up in the box score, xRAPM can further differentiate between them. Other information used in xRAPM is height data (taller players are better defenders), as well as game-state (teams coast with big leads). These factors are thrown in there to the degree to which they increase predictive accuracy for out of sample data. Other models like SPR add other information, like salary data it seems.
 
In this way, xRAPM is barely even a plus minus stat. It's more of a giant stew of everything we know about a basketball player that can be quantified, and added together in the proportions that have the greatest predictive accuracy. Plus/minus is the base and it has a ~.85 correlation with RAPM, but that's just because the box score and other data doesn't seem to change things all that much. To compare, the best box score metric I know (ASPM) has a ~.45 correlation with RAPM, and it is explicitly designed with 1 goal - to mimic RAPM.
 

luckiestman

Son of the Harpy
SoSH Member
Jul 15, 2005
32,776
Thanks for the thorough breakdown, really. I have some objections (based on epistemology), but it seems fun. I still don;t know what the error band is, I think you have quite a few estimates going on and then they are compounded. So if you saying some thing like Joe's expected whatever is X, ok, what is the confidence interval on that? When we rank order players by this stat when is one player statistically different from the other? 
 

wutang112878

Member
SoSH Member
Nov 5, 2007
6,066
Are there any projection/estimated stats you are comfortable with?  Because the alternative to not using these is to rely on a more subjective approach which probably has a higher error rate than this.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
luckiestman said:
Thanks for the thorough breakdown, really. I have some objections (based on epistemology), but it seems fun. I still don;t know what the error band is, I think you have quite a few estimates going on and then they are compounded. So if you saying some thing like Joe's expected whatever is X, ok, what is the confidence interval on that? When we rank order players by this stat when is one player statistically different from the other?
As near as I can tell, these concepts are not very helpful in the ridge regressed world. The regularization appears to undermine them:
 
The reason for this is that standard errors are not very meaningful for strongly biased estimates such as arise from penalized estimation methods. Penalized estimation is a procedure that reduces the variance of estimators by introducing substantial bias. The bias of each estimator is therefore a major component of its mean squared error, whereas its variance may contribute only a small part.
Furthermore, because of the projection involved through box score stats, height, and whatever else, the idea of standard errors becomes even stranger, although I suppose you could plausibly add them.
 
But as I said before, this is a bit over my head. I've taken a few stats classes, but I'm far from a statistician. Almost all my knowledge in this field is self-taught, and while I know a good deal about how these models are constructed, I can't explain the math behind ridge regression to you.
 

Grin&MartyBarret

Member
SoSH Member
Oct 2, 2007
4,932
East Village, NYC
This is as good a place as any for this, but I thought it would be a good idea for folks here to share their NBA analytics resources. I used to consider myself reasonably up-to-date on such things, but haven't kept up with it much in the last year or so. My go-to resources tend to be NBA.com/stats and basketball-reference.com, but there are obviously a ton of other alternatives out there. Where do other folks go?
 

wutang112878

Member
SoSH Member
Nov 5, 2007
6,066
ishmael said:
Here is one amazing chart for Antoine from 2002-2003 (his last full season in green):
 
This is stating the obvious, but it cant be stressed enough:  There is an amazing amount of blue on that chart, especially when you consider that he was taking 20 shots a game, 20 shots a game.  Its really not a surprise that one of Danny's first goals was to ship him out of town.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
Grin&MartyBarret said:
This is as good a place as any for this, but I thought it would be a good idea for folks here to share their NBA analytics resources. I used to consider myself reasonably up-to-date on such things, but haven't kept up with it much in the last year or so. My go-to resources tend to be NBA.com/stats and basketball-reference.com, but there are obviously a ton of other alternatives out there. Where do other folks go?
I don't use 82games as much as I used to, but it's still a good resource for lineup data, and stuff like FG% at various points in the shot clock (which I think is super important).
 
After previously being familiar with adjusted plus/minus stats, I've recently digged into the nitty gritty of how exactly they're calculated, and have become much more comfortable with all the "adjustments" going on. The best source of this data is ESPN's RPM site, which is really xRAPM, and RAPM available from Shut Up and Jam.
 
Shut up and Jam also has Estimated Impact, which is similar to ASPM. These are box score stats like win shares or PER, not plus/minus stats. However, while win shares, PER, and wins produced are "bottom up", in that they attempt to value the effect of every box score stat on winning and losing, Estimated Impact and ASPM are "top down" stats that try and and replicate RAPM in a box score metric. The result is a box score stat that almost matches RAPM in predicting the impact of players changing teams. In my testing Estimated Impact is significantly better than ASPM actually, but the formula is not public, so I tend to use ASPM, as you can tell the exact weights used.
 
There is a large potpourri of stats available here, at Stats for the NBA, but this site is a bit cryptic at times and you need to consult the APBRMetrics forum to figure out what's going on.
 
I get my International, D-League, Preseason, and Summer League data at RealGM. College data I mostly get from Sports-reference.com. I highly recommend downloading the NCAA ASPM spreadsheet however if you want to parse college data across all players, as the Sports-reference format limits you to league specific setups.
 

Devizier

Member
SoSH Member
Jul 3, 2000
19,566
Somewhere
Deadspin/Regressing has some highlights of ESPN's (insider) stat guys talking to first and second-year players at summer league about how the numbers rate their games:
 
Here's a sample with Giannis Antetokounmpo:
 
 
 
Even something as simple as Giannis's ridiculous turnover rate—he "only" averages 1.6 per game, but plays 24.6 minutes and had a usage rate of 14.6, meaning that 1.6 is a TON—has some depth:
 
 
6: Turnover percentage: 19.4 percent, 10th highest in NBA.
BD: Last year, turnovers were an issue for you at times. This summer, it seems like you've improved your ballhandling. How much has that been a focus for you this summer?
GA: I'm focusing on decision-making. But coaches tell me to stay aggressive. If I screw up on offense, they tell me to stay aggressive. I can draw two guys on me, three guys. Coach tells me last night I had three guys on me and told me to find the open pass. I just try to throw a good pass.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
This is a kind of exciting development for those who are looking for an easily accessible "all-in-one" stat better than WS/48. For those too lazy to read the the link, basketball reference is now providing an box score version of the adjusted plus/minus metrics I described in post #20. It's been renamed BPM, because names.
 
There's not too much reason to use BPM for current players/current seasons, apart from ease of use. It's a good stat, but its ceiling, by definition, is RAPM/RPM, which are both available freely on a variety of sites. Where BPM is useful however is to compare to former players and former seasons. The play-by-play data needed to create adjusted plus-minus stats isn't readily accessible before 1999, but since BPM is box score based, we can get decent comparisons of MJ to LeBron to Kobe:
 
https://twitter.com/bbstats/status/527528828084961280
 
This is mostly a win for "ease of use" type stuff, as while this stat and its ilk have been floating out there (I have my own version which I created, there are many others), the convenience of being able to get it at glance at Basketball-reference is a big win.
 

luckiestman

Son of the Harpy
SoSH Member
Jul 15, 2005
32,776
Well on its face that chart seems about right to me. Lebron slightly better than MJ and both better than mamba.