Introducing...[Name TBD] - NBA Box Score Projection Project

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
As I've alluded to elsewhere, and as I've referenced on Twitter in this thread, I've been working on a project to generate NBA box score projections. This is by orders of magnitude the most significant private NBA project I've worked on. As the season is upon us, I wanted to post here to let people know about what these are, a bit about how they work, and their strengths/weaknesses.

Essentially, I have attempted to create to something resembling Steamer/PECOTA/ZiPS for the NBA. This was driven by the fact that similar projection models don't seem to exist in the public domain for the NBA, despite the existence of fantasy basketball (while there are products like Basketball Monster, those are "hand curated" projections). Some features of the model:

  • Daily Updating and Recency Weighting. This is really the core advance of the model. In addition to creating a preseason projection for each player (e.g., something which tells you that Kawhi Leonard is projected for a 45.7% FG% this year), the model I've built constantly updates in response to new information, weighting more recent performance more heavily. This applies both to new games this year, but also for games for prior seasons, dealing with the issue of "Player X struggled overall last year, but closed the year extremely strong". In principle, that closing-the-year-strong is accounted for in this model. I do this without any arbitrary endpoints of looking at the last X games. Every game a player has ever played is part of their projection. This accounts for spotting "breakouts" and declines, and updating projections to account for new data on a day-by-day level.
  • Seasonality. Offensive efficiency to start the season is extremely low, and increases throughout the year. This applies to essentially every element of offense. Turnovers are high. Shooting percentages are low. Assist rates are low. Here's a chart of 2pt FG over the last 19 years, with the black dots being observations, and the blue line being the trend. As you can see, within each season, the trend is for 2pt FG% to rise:
26367

The model accounts for such seasonality, and likewise accounts for the small dip you see there in 2pt FG% about two thirds of the way through the year (which coincides with the all-star break). This is done on a component by component level, with different effects modeled for each component.
  • Rest/travel/home court effects. This is mostly what it sounds like, and the model accounts for these effects, again on a component by component level. On a technical level, I have gone to great lengths not to overfit these effects, and the projections include projected rest/travel/home court effects, themselves accounting for recency, and changes in the league environment (home court for instance is decreasing overall).
  • Opponent Adjustments. This is similar to the above - I have adjusted these projections for who each team is playing on a given night, accounting for the projected defensive strength of that team's influence on every individual component. This aspect is done at a team-level only for now, although I will move this to a position-level adjustment in time (e.g., the Celtics defend opposing guards with different effectiveness than opposing bigs).
  • Tracking data. Similar to 538's debut of RAPTOR, these projections account for some of the next-gen tracking data available from stats.nba.com. That means that something like a player's average speed, or their hustle stats affect their projections. This data mostly does not have a big impact on box-score projections, although I am continuing to do further work here. This is a major aspect of future improvement to the model.
  • Interaction Effects. The model accounts for the interaction between various box-score components, so if a player improves their free throw shooting, their projected three point shooting will likewise also improve. The same goes for interactions between other stats.
  • Free Agency. The model regresses players to the mean when they change teams. Changing teams has a big impact on some box-score components, and the model accounts for that. It also gets less confident in a player's overall projection in response to changing teams - meaning the model will update more quickly in response to new data after a player has changed teams.
  • Preseason and Summer League Data (Coming Soon). This is actually not yet in the model, but I wanted to call it out as I may yet be able to get it in there in a week or so. Right now, every rookie has the same base projection (with slight age differences). In order to address that, I'm going to be adding preseason and summer league data to their projections. This data will impact different players differently, and should help address the issue of Zion having roughly the same projection as Romeo right now. As this is not yet included however, this is currently a major weakness of the model right now (all rookies have functionally a replacement level projection). I may in time also add NCAA and foreign data, but that's a massive undertaking and is a 2020-2021 project at best.
  • Minutes. I project minutes per game for each player, in a similar approach to everything else. However, these projections predictably end up being pretty rough, as it's very difficult to project PT just from past data. These projections are provided, but there is no "hand adjustment", so make of this what you will. Feel free to update the minutes projections as you like and adjust the, stats up/down as necessary. For example, right now I have Anthony Davis projected to playing only 28.3 minutes a game, as he changed teams, and barely played down the stretch last year. As he plays more minutes, the model will rapidly update him to the mid 30s of course, but I wanted to call out this caveat. I've put in a lot of work on the minutes, but I'm pretty sure someone doing this by hand could improve on them.
Now for the results. I have tested these projections against three years of daily projections from Basketball Monster, and two years of data from another leading DFS site (which provided me with data for benchmarking free of charge, so I'd rather not call them out here). The model beats both sites in every stat, some by a substantial margin (shooting percentages in particular). That includes the minutes projections, which as noted, have zero hand adjustments, and could likely be improved with a hand adjustment as well. The model does especially well relative to those two sites on low-minute projected players (which makes sense, as those those sites don't spend much time on them), but also beats them on players projected for more minutes.

I think that describes the bulk of what I've done here. A cut of the projections for opening night games (literally just 4 games) can be found here: View: https://docs.google.com/spreadsheets/d/1mhwOLqPu2F9026EQiVxFPIN1t9RGafGpl-dokaIsm9c/edit#gid=0


This will be expanded, hopefully by tomorrow, to include:
  • Base level projections for each player.
  • Daily projections for each player for each day of the season.
I will be updating all of those projections daily, hopefully by around 9am (it depends how long it takes to scrape the various inputs, and then it takes an hour to refit the model daily).

I am also working to generate team-win projections using this data, which I know some of you had asked for. It's coming I promise, but it may end up being a Tuesday afternoon thing unfortunately.

This project is very much a work-in-progress, and will be updated throughout the season, so you may see big changes in some players' projections in response to model changes, as opposed to new data. I'd bet you'll also see some bugs, and obvious errors, and I'm working hard to correct those, but given the scope of this project, it's inevitable you'll find some of them. Please feel free to DM me here, or on Twitter to alert me to any errors. I may at some point built out a better front-end for this than Google Sheets, but I have zero expertise in something like that, so it'd be a ways away.

Enjoy!
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
Oh, and I should add, I am very actively soliciting naming ideas for this, as every sports projection system needs to have a witty/dumb backronym around it.
 

DJnVa

Dorito Dawg
SoSH Member
Dec 16, 2010
39,705
Oh, and I should add, I am very actively soliciting naming ideas for this, as every sports projection system needs to have a witty/dumb backronym around it.
This is very cool. I may throw in a DraftKings lineup for opening night using it.

As far as names--who's your favorite role player? Tacko? Carsen?
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
This is very cool. I may throw in a DraftKings lineup for opening night using it.
Yeah - this wasn't intended to be a DFS product really, but rather something more akin to season-long projections, but due to some design decisions, it accidentally ended up on a taking a distinctively DFS-focused flavor. I have no DFS experience, so I haven't done any true testing for how the projections would fare there.

For DFS purposes, especially for the first game, I would recommend playing with the minutes projections to account for the fact that a bunch of guys will play more/less than I project above. It's just very hard for an algorithm to do that accurately, especially with 40% of the league having changed teams.
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
Not a big Tacko guy, so not gonna try to force my way into a naming scheme around him. A couple of concepts which feature heavily here are 1) Exponential decay (or just decay); 2) Gradient; 3) Boosting; 4) Optimization; 5) Bayesian.

Working Tracking in there would be good too, to emphasize I'm using next-gen data.
 

Phil Plantier

Member
SoSH Member
Mar 7, 2002
2,584
For naming purposes:

1. When did you start watching basketball?
2. Who does your system rank more highly than we would expect?

Edit: just so I don't spend all afternoon trying to figure out a way to name it DUEROD
 
Last edited:

DJnVa

Dorito Dawg
SoSH Member
Dec 16, 2010
39,705
Yeah - this wasn't intended to be a DFS product really, but rather something more akin to season-long projections, but due to some design decisions, it accidentally ended up on a taking a distinctively DFS-focused flavor. I have no DFS experience, so I haven't done any true testing for how the projections would fare there.

For DFS purposes, especially for the first game, I would recommend playing with the minutes projections to account for the fact that a bunch of guys will play more/less than I project above. It's just very hard for an algorithm to do that accurately, especially with 40% of the league having changed teams.
Oh sure, more just for fun. Early season DFS can be beatable in the sense that some guys aren't priced correctly, and if your model has hit on some of them, then it helps. I was able to work LBJ and Leonard into my lineup.
 

Sam Ray Not

Member
SoSH Member
Jul 19, 2005
6,237
NYC
Bayesian Opponent-adjusted Boosting / Calculation & Optimization Updated Since Yesterday

(BOBCOUSY)
 
Last edited:

wade boggs chicken dinner

Member
SoSH Member
Mar 26, 2005
18,673
Not a big Tacko guy, so not gonna try to force my way into a naming scheme around him. A couple of concepts which feature heavily here are 1) Exponential decay (or just decay); 2) Gradient; 3) Boosting; 4) Optimization; 5) Bayesian.

Working Tracking in there would be good too, to emphasize I'm using next-gen data.
Bayesian Optimization Of Gradient Exponentialy-decayed Ranking.

Do I win?
 

tmracht

Well-Known Member
Silver Supporter
SoSH Member
Aug 19, 2009
448
Boxscore
Offensive
Winshare
Influenced
Adjusted
Calculator
 

ElUno20

Member
SoSH Member
Jul 19, 2005
3,819
Dont apologize. This is great work.


Also, this is telling me lebron's not getting a double double tonight?
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
Dont apologize. This is great work.


Also, this is telling me lebron's not getting a double double tonight?
The minutes side is always going to be a bit wonky, since it just hard to predict coaching decisions, and incorporate a ton of outside information. It's going to be especially regressed to start the season as the system learns what the rotations are for each team. The minutes projections perform well overall (better than paid fantasy sites), but you can almost certainly do better through a combination of human input and computer analysis there.

I have no real front-end experience, but I hope to eventually be able to build a tool that'll let you update the minutes projections and the other box-score projections will update accordingly.
 

benhogan

Baynes Hogan (pending trade)
SoSH Member
Nov 2, 2007
8,598
Santa Monica
The minutes side is always going to be a bit wonky, since it just hard to predict coaching decisions, and incorporate a ton of outside information. It's going to be especially regressed to start the season as the system learns what the rotations are for each team. The minutes projections perform well overall (better than paid fantasy sites), but you can almost certainly do better through a combination of human input and computer analysis there.

I have no real front-end experience, but I hope to eventually be able to build a tool that'll let you update the minutes projections and the other box-score projections will update accordingly.
I'm probably not adding anything. BUT during the season wouldn't a player's minutes slide up/down with his performance VS. the performance from teammates that play a similar position?

For example, when Paul George returns I'd expect Harkless minutes to not be as impacted as PatPat's minutes since I expect Harkless performance to be superior. But if PatPat outplays Harkless I could see Doc changing how he doles out minutes
 
Last edited:

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
The minutes model is cognizant of the overall quality of the player, and will dole out more minutes to better players, yes. Insofar as Patrick Patterson has better projected by the time George returns, his projection will be less impacted.

That's why AD was projected for as many minutes as he was. The model sees he's an excellent player, so doesn't totally believe that he's a fringe guy on the edges of the rotation (as his usage down the stretch last year suggested).
 

DJnVa

Dorito Dawg
SoSH Member
Dec 16, 2010
39,705
Does it use rosters from end of last season? For instance I see RJ Hunter listed for Celtics, with a set number of minutes.
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
Does it use rosters from end of last season? For instance I see RJ Hunter listed for Celtics, with a set number of minutes.
It's pulling supposedly current roster data from the Sportsdata API. As you note, I've found this turns out to have limited reliability, so I'm working to replace it right now. It'll be fixed today.

(This issue did not affect the win projections, where I used a different datasource).
 

DJnVa

Dorito Dawg
SoSH Member
Dec 16, 2010
39,705
What about no data for rookies and how that could affect the win projections?
 

djbayko

Member
SoSH Member
Jul 18, 2005
11,108
Waltham, MA
Awesome stuff, @bowiac ! The first tab of the Google sheet is player projections for today, correct? Do you have full season projections at a player level? I assume you do.
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
What about no data for rookies and how that could affect the win projections?
For the win projections, I cheated and used RAPTOR's projections for the rookies for this reason. Once I integrate summer league and preseason data (next year), I will drop that as I think this model is more sound overall, but for now it was necessary.
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
Awesome stuff, @bowiac ! The first tab of the Google sheet is player projections for today, correct? Do you have full season projections at a player level? I assume you do.
The first tab is today's projections, that's right. I actually do not yet have full season projections available. It's an iterative process to generate these projections, meaning each day's results depend in part on the results from the day before (including affecting players who didn't play today, as it signals changes in the league environment overall). The in-season seasonality also means I can't just multiply each player's projections by 82 to get full season numbers.

Full season numbers will eventually be coming, but it's looking like it may take a week or so.
 

DJnVa

Dorito Dawg
SoSH Member
Dec 16, 2010
39,705
You likely don't care, but since I'm pretty bad at DraftKings I'm using some of these numbers when I make my lineups. I make one lineup my usual way and one using some of this info. Last night my "normal" lineup was pretty bad. The one where I used some of these numbers didn't win anything, but did much better. I didn't place because Kemba was a no-show and Embiid was maybe 10-15 points under projections.

The guys at the margin are where I usually need help--Dwayne Bacon and Goran Dragic--did really well for me.
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
Interesting that it has Giannis with 21 minutes tonight?
It takes a few games for the minutes aspect to "warm up" on these guys it seems; issue is that Giannis played high 20s in minutes down the stretch last year, including 24 minutes in the second to last game. I'm surprised it's docking him all the way to 21, but essentially that's what's going on - it's getting confused by stars resting for the playoffs at the end of last year, and thinking they're on the fringes of the rotation.

I'll need to work on something for that, but it's a complex problem to solve in a principled/rigorous manner.

I will be debuting a projection for whether a player will Start, come off the bench, or be a DNP, along with probabilities for each. This sort of segmentation I've found generates a material upgrade in the minutes projections (mean absolute error drops from 4.9 mpg to 4.5). It also has the benefit of making the output more interpretable, so you can see what's driving the minutes projection. I did the modeling on this last night, but it'll take a few days to work into the pipeline daily.
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
You likely don't care, but since I'm pretty bad at DraftKings I'm using some of these numbers when I make my lineups. I make one lineup my usual way and one using some of this info. Last night my "normal" lineup was pretty bad. The one where I used some of these numbers didn't win anything, but did much better. I didn't place because Kemba was a no-show and Embiid was maybe 10-15 points under projections.

The guys at the margin are where I usually need help--Dwayne Bacon and Goran Dragic--did really well for me.
No, I love to hear this. Thanks. As I mentioned, I've tested these projections, including the wonky stuff against two DFS sites which sell projections for the last 2 years (scraping their daily projections and comparing), and the model beats both sites in every stat, including minutes per game. But it's good to hear it's translating off of paper and into reality.

And your observation that the guys at the margins are where the benefits are is also correct. The DFS sites selling projections mostly do them by hand, and can do a good job by focusing on the stars, but it's tough to put in that effort for 200 guys a night. The model isn't perfect, but it doesn't struggle for lack of effort on the fringier players.
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
How will your model account for injured players or players returning from injury? Old veterans getting a maintenance day?
I don't have any explicit injury accounting, although the longer a player goes without playing, the more their projection "decays". In a technical sense, while there's a lot of bells and whistles, the engine of this model is an exponential decay function which takes every game a player has ever played, and decays it by X^DaysAgo, where X is number between 0 and 1 (and X varies by stat). This means that a player projected for 25 points per game will have their projection go down simply by not playing, because their past history of performance is more DaysAgo. So if X is 0.9 for instance, that means a game yesterday gets a weight of 0.9, while a game 10 days ago gets a weight of 0.35. The longer the player sits out, the more their projection decays.

This decay function isn't actually the projection; it's just a feature in a stacked neural network/gradient-boosted model (to throw some jargon out there), but it drives ~80% of the projection.
 

DJnVa

Dorito Dawg
SoSH Member
Dec 16, 2010
39,705
And your observation that the guys at the margins are where the benefits are is also correct. The DFS sites selling projections mostly do them by hand, and can do a good job by focusing on the stars, but it's tough to put in that effort for 200 guys a night. The model isn't perfect, but it doesn't struggle for lack of effort on the fringier players.
I feel like I can give a decent prediction on how guys like Harden, Walker, Tatum, Lillard, Irving, Davis, James, etc. will play--but finding which cheap guys to play is where, for DFS, you make a difference.
 

DJnVa

Dorito Dawg
SoSH Member
Dec 16, 2010
39,705
I entered an inexpensive contest last night using your numbers with "crowns" and won a few bucks. I was up to 2000th out of like 36000 before the late game, and I had Draymond and Damion Lee ($3000 player)--if Green had an average game I would have finished around 750th and had he been a bit better than average I would have been in top 250, but Draymond got injured and while he returned, he didn't do much.

The upshot though, is that Green and Lee were the only 2 players I picked to not outperform their salary. In Lee's case it hardly mattered.

Tonight, it has Kemba, Hayward, and Tatum all as good values based on your projections and the DraftKings salaries.

Best value on the board based on your projections is old friend Marcus Morris.
 

gingerbreadmann

Member
SoSH Member
Mar 11, 2008
740
I've spent an inordinate amount of time messing around with these projections over the last week or so, and while I don't have any grand findings to share, I just want to share my kudos and thanks to the work behind them. Lots of fun to sift through and play with.

Lacking a more robust end goal at the moment, I too have been using them for daily fantasy lineup creation. I haven't played DFS in years so it's a bit of a refresher for me. None of my lineups using the projections have hit so far, but I'm sure that's a SSS issue as well as a me issue. I just wanted to think out loud a bit on the thought processes I'm stuck on.

-These projections are better suited for 50/50 games than tournaments, or am I wrong about that? The value will be on the margins no matter what game you play, but taking the numbers and translating them to an expected DFS value (relating to this, I've spent way too long fine-tuning a sigmoid function to estimate double-double and triple-double probabilities) ignores upside. Having, say, a 75th percentile outcome along with expected value would be very insightful, and is something I'm trying to produce, even in bare-bones fashion.
-On that note, I have been playing a lot with the spreadsheet you linked in the original Twitter thread @bowiac, of the game-by-game results from last year, to identify what factors are behind someone exceeding their projection. The upshot is that nothing I've found can even hold a candle to Minutes. Error in the MP projection accounts for almost half of the difference between actual and expected DraftKings points. There are a few factors that correlate, but really nothing I've tried has any value without knowing how the minutes will pan out. (In the 2018-19 sheet you have a column for whether the player started, which looks like the most valuable forecast input if available.)
-Lineup construction is also something I'm a bit of a novice at. So far I have strictly been using expected points, salary, and position inside a linear optimization function to create the lineup with the most expected points. Am I missing other important factors here? I have seen articles about football DFS that estimate the % drafted by player, which I suspect could be a useful tool to find players who are truly overlooked, but I have done zero work on estimating that. I've also tinkered with evaluating each player's value based on what position they're used at (obviously this only applies to DraftKings), but haven't decided how to approach that strategically.

Excited to hear more about the new minutes projection you have in the works. It's already impressive but any small accuracy improvements here would have a massive payoff. Any thoughts welcome towards anything I just mentioned.
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
I've spent an inordinate amount of time messing around with these projections over the last week or so, and while I don't have any grand findings to share, I just want to share my kudos and thanks to the work behind them. Lots of fun to sift through and play with.

Lacking a more robust end goal at the moment, I too have been using them for daily fantasy lineup creation. I haven't played DFS in years so it's a bit of a refresher for me. None of my lineups using the projections have hit so far, but I'm sure that's a SSS issue as well as a me issue. I just wanted to think out loud a bit on the thought processes I'm stuck on.

-These projections are better suited for 50/50 games than tournaments, or am I wrong about that? The value will be on the margins no matter what game you play, but taking the numbers and translating them to an expected DFS value (relating to this, I've spent way too long fine-tuning a sigmoid function to estimate double-double and triple-double probabilities) ignores upside. Having, say, a 75th percentile outcome along with expected value would be very insightful, and is something I'm trying to produce, even in bare-bones fashion.
-On that note, I have been playing a lot with the spreadsheet you linked in the original Twitter thread @bowiac, of the game-by-game results from last year, to identify what factors are behind someone exceeding their projection. The upshot is that nothing I've found can even hold a candle to Minutes. Error in the MP projection accounts for almost half of the difference between actual and expected DraftKings points. There are a few factors that correlate, but really nothing I've tried has any value without knowing how the minutes will pan out. (In the 2018-19 sheet you have a column for whether the player started, which looks like the most valuable forecast input if available.)
-Lineup construction is also something I'm a bit of a novice at. So far I have strictly been using expected points, salary, and position inside a linear optimization function to create the lineup with the most expected points. Am I missing other important factors here? I have seen articles about football DFS that estimate the % drafted by player, which I suspect could be a useful tool to find players who are truly overlooked, but I have done zero work on estimating that. I've also tinkered with evaluating each player's value based on what position they're used at (obviously this only applies to DraftKings), but haven't decided how to approach that strategically.

Excited to hear more about the new minutes projection you have in the works. It's already impressive but any small accuracy improvements here would have a massive payoff. Any thoughts welcome towards anything I just mentioned.
Good questions. Some thoughts:

1) Better suited for 50/50 games than for tournaments. That's absolutely correct. I don't have much here in the way of 75th percentile upside projections. I have ideas for how to handle that, but it's a long ways from being developed.

2) Minutes. As you note, the minutes are the key, and the downside to doing minutes via algorithm is that they make somewhat strange errors from time to time. The minutes projections I have here perform well, and beat pay-DFS sites in accuracy overall, but they do make somewhat head scratching errors. I hope to have some significant refinements to that aspect this weekend, telling you a start/bench projection, and associated minutes, although as noted, that's still not a silver bullet (minor accuracy improvement).

One other thing I've tried is blending these minutes projections with minutes projections from Basketball Monster, and that has led to a moderate improvement in my private testing. Those Basketball Monster projections are a paid-for product, so I can't make them public here unfortunately.

3) Lineup construction. I likewise use a linear optimizer (via PulP) to select the best lineup. I presume there are other aspects which would be helpful, but I don't know enough about DFS strategy to help there.

Anyway, this project remains in very active development right now. I've recently started selling the minutes projections I generate to a DFS site (non-exclusive access only), so I've been a bit busy building that pipeline, but I will have further updates to the model in the coming weeks and months.
 

slamminsammya

Member
SoSH Member
Jul 31, 2006
3,152
Palo Alto
Im curious whether you used gbm or a neural net for the minutes model. In the case of gbm what were the features and did you consider weighting the data points to make the model better for higher minutes players?
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
Im curious whether you used gbm or a neural net for the minutes model. In the case of gbm what were the features and did you consider weighting the data points to make the model better for higher minutes players?
I tried both, but settled on a gradient boosted model (LightGBM, having also tried xGBoost and Catboost) as my base estimator. I say base estimator, because I do plan on putting together a stacked model eventually, but right now LightGBM is doing everything.

As I mentioned, the core features of the model are a very expensive to calculate exponential decay function which takes the form of X^DaysAgo, where X is between 0 and 1, and varies by stat. At lower values of of X, more recent performance is weighted more heavily, while at higher values, older games take on more weight. At the extreme, X == 1, the feature is just an average of every game the player has played. I apply this exponential decay function to every stat, which itself can be thought of as giving me a projection for every statistic. In other words, have a weighted average for Marcus Smart's minutes (30.07 entering the next game), which is assembled by that X^DaysAgo weighting.

Separately, I also use a modified Kalman filter to generate projections for every statistic, which is likewise fit to minimize error in projecting the next game. This, together with the exponential decay function, give me interim projections for every stat.

These two sets of features (the exponential decay features, and the Kalman filter features) make up the bulk of the features for the model. That's 46 features in all right now, i.e., your interim projections in every stat impact every other stat's final projections.

For minutes, I also add an additional feature of how many minutes a played played last game. In principle, this shouldn't be necessary, as that should be captured by the features above, but I've found it helpful. Finally, I also add a team feature to help make the minutes add up to something close to 240 per night.

In terms of weighting, I'm not sure what you mean. What would the data points be weighted by exactly?
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
Minor addition: I have added a tab to the spreadsheet which outputs the top 50 Draftkings Lineups using these projections.
 

slamminsammya

Member
SoSH Member
Jul 31, 2006
3,152
Palo Alto
I tried both, but settled on a gradient boosted model (LightGBM, having also tried xGBoost and Catboost) as my base estimator. I say base estimator, because I do plan on putting together a stacked model eventually, but right now LightGBM is doing everything.

As I mentioned, the core features of the model are a very expensive to calculate exponential decay function which takes the form of X^DaysAgo, where X is between 0 and 1, and varies by stat. At lower values of of X, more recent performance is weighted more heavily, while at higher values, older games take on more weight. At the extreme, X == 1, the feature is just an average of every game the player has played. I apply this exponential decay function to every stat, which itself can be thought of as giving me a projection for every statistic. In other words, have a weighted average for Marcus Smart's minutes (30.07 entering the next game), which is assembled by that X^DaysAgo weighting.

Separately, I also use a modified Kalman filter to generate projections for every statistic, which is likewise fit to minimize error in projecting the next game. This, together with the exponential decay function, give me interim projections for every stat.

These two sets of features (the exponential decay features, and the Kalman filter features) make up the bulk of the features for the model. That's 46 features in all right now, i.e., your interim projections in every stat impact every other stat's final projections.

For minutes, I also add an additional feature of how many minutes a played played last game. In principle, this shouldn't be necessary, as that should be captured by the features above, but I've found it helpful. Finally, I also add a team feature to help make the minutes add up to something close to 240 per night.

In terms of weighting, I'm not sure what you mean. What would the data points be weighted by exactly?
I am not sure I fully understand your features, because I am not totally sure I understand what the target variable of the GBM is. Do you have a GBM - per - statistic? Or is there a GBM purely for minutes? If the latter, are you predicting minutes per game over the course of the season per player, or is it predicting minutes in the next game?

Supposing I wanted a model to predict, for a given player, the number of minutes they would play in the next game, I would think you would do very well with only a few features: Career MPG (or maybe your exponentially decayed MPG), age, team W-L, game #, whether its a SEGABABA, team conference standing, draft position. Maybe you have already tried this. Those features should account for this weirdness with good players on good teams getting rested towards the end of the season.

If that were your output variable you have a lot of data points - one per player-game.

The weighting works basically when the GBM computes the loss function it is trying to optimize for - you can basically assign weights to the data that tell the GBM that minimizing loss for those data points counts "more" than minimizing loss for other data. So an easy first guess at how you would weight those is by minutes! If a player played 40 minutes in a game, you weigh that single data point by 40, etc. Maybe that weighting is too aggressive. But as it is the GBM will treat the error on a scrub who never plays as the same as the error on Giannis's minutes at the end of the season.

I just looked in the documentation for LightGBM and they have the weights column option: https://lightgbm.readthedocs.io/en/latest/Parameters.html go to weight_column.
 

bowiac

I've been living a lie.
Lifetime Member
SoSH Member
Dec 18, 2003
12,633
New York, NY
I am not sure I fully understand your features, because I am not totally sure I understand what the target variable of the GBM is. Do you have a GBM - per - statistic? Or is there a GBM purely for minutes? If the latter, are you predicting minutes per game over the course of the season per player, or is it predicting minutes in the next game?
I think I'm currently projecting 19 stats (including minutes), so I have 19 separate models. Each model has 38 features, consisting of 19 exponential decay projections of each stat, 19 Kalman filter projections of each stat, and a handful of other features, such as age in days, rest, back to back, air miles traveled, etc...

All projections are for the next game. I don't project any season-level results.

Supposing I wanted a model to predict, for a given player, the number of minutes they would play in the next game, I would think you would do very well with only a few features: Career MPG (or maybe your exponentially decayed MPG), age, team W-L, game #, whether its a SEGABABA, team conference standing, draft position. Maybe you have already tried this. Those features should account for this weirdness with good players on good teams getting rested towards the end of the season.
I've tried some of these (age, game number), but not others (e.g., draft position). They mostly don't improve model performance, but I've included them anyway. The exponential decay and Kalman Filter features are really driving the projections here. I will add draft spot and some biometric data later (height, weight, maybe reach and wingspan). I doubt they'll matter much for minutes.

The reason none of these features do much is because the exponential decay and the Kalman filter already serve as golden features (working out the math on those and calculating those parameters was the vast majority of the work on this project - just outputting those features would be pretty strong projections by themselves).

If that were your output variable you have a lot of data points - one per player-game.
Yup - I have roughly 600,000 rows.

The weighting works basically when the GBM computes the loss function it is trying to optimize for - you can basically assign weights to the data that tell the GBM that minimizing loss for those data points counts "more" than minimizing loss for other data. So an easy first guess at how you would weight those is by minutes! If a player played 40 minutes in a game, you weigh that single data point by 40, etc. Maybe that weighting is too aggressive. But as it is the GBM will treat the error on a scrub who never plays as the same as the error on Giannis's minutes at the end of the season.

I just looked in the documentation for LightGBM and they have the weights column option: https://lightgbm.readthedocs.io/en/latest/Parameters.html go to weight_column.
Yeah, I know how to do sample weights in LigthGBM and they're key to most of the projections I do (e.g., three point percentage projection is weighted by three point attempts). The question/confusion I had is what the relevant weight would be for minutes. I think it would be "games", so it just ends up being a weight one 1 for each player.

You're proposing to use minutes as the weight, but that's also the target variable. I've never heard of using the target variable as a sample weight. I'm not sure such a weighting is sensible. Consider if a given model fit projects Carsen Edwards to play 30 minutes, and he actually plays 2 minutes, and it projects Kemba Walker to play 38 minutes, and he actually plays 36, that would be a weighted error of 3.37. That's actually a wonderful result (my MAE overall is ~4.6 right now), but is pretty obviously nonsense, since I've projected a player out of the rotation to play 30 minutes! That's a useless result for DFS, fantasy, gambling, etc...

As a comparison, that would grade as better than an unweighted model which projects Kemba at 40 minutes and Edwards at 6 minutes. For almost any purpose, I would prefer the unweighted result there.