The overall distribution the confidence interval would come from would be Sandoval's performance, and not the performance of all ball players. His UZR is not a measurement of all player's performance, just his own. And his actual performance is just a sampling from the total distribution, but it is the total distribution of his possible performance. Just like you wouldn't mix the results of many different sided dice into one distribution, you shouldn't do that for multiple players, there may be considerably differences between them. I can't prove it given the numbers I have, but because of the uncertainty that all players have equivalent variance in performance, you can't equate them and smoosh all the variances into one number. But again, I appreciate your direction of thought.
And also, I am sure he sucked in the field last year, it's just the degree of it and the predictive value of it that am unsure of. Too much uncertainty.
I don't think anyone can prove it with the numbers we have and given the metrics of interest (UZR/DRS). Statistically, we may be underpowered to derive such an estimate of performance, which means we can't trust the metric.
One could, theoretically, estimate confidence intervals for the individual player using
bootstrapping techniques, which should produce a normal distribution if the input data has enough samples. However, once you do so you start to run into a severe error that you have alluded to in your post: the input data may not have sufficient samples. Practically speaking, measures like UZR and DRS (
as noted by many, including an article on SOSH baseball, though I'm having trouble finding the link right now) depends on the contributions of every other player, and is really attempting to predict the likelihood a player would save runs relative to the average player at the same position. When determining how to bootstrap a dataset, determining the amount of data to select when resampling becomes critical; should you resample over 500 innings, 1000 innings, the entire dataset, etc? One way to identify such a threshold is to examine the distribution of bootstraps across a range of resampling values and individual players, and measure the bias between the bootstrapped central tendency (e.g. the mean across all bootstraps) and the observed value per player (e.g. mean bias and variance). Once you start to do this, you realize that any individual player may require as much as 10 years of data to estimate the intervals without introducing any bias to the estimate of central tendency. In other words, even theoretically, one may not know the "true" UZR until, generally speaking, after the player is past his prime.
That paragraph was dumb, so I deleted it.
At this point (at least in my thinking), I've come to think that UZR/DRS shouldn't really be used at all to think about defense. Outside of statcast, the eye and crowd-sourcing are probably more valid estimates of defense.
Of course, Hanley/Panda were still bad defenders last year, but that's because you'd have to be blind to argue otherwise. I'm not sure any of us have a quantitative idea of how Hanley will handle 1B or whether Panda will improve. I don't even think it makes sense to use UZR/DRS to measure defense at all at 1B: by far the most important ability of a good 1B is to catch poorly thrown balls quickly, the second most important ability is to field grounders well. Throwing and range can be important, but I suspect that those opportunities tend to be much rarer.
Personally, I care more about the hitting. Hanley had his worse offensive season ever (.717 OPS), after hitting 800+ OPS in literally every other season, except in 2011/2012. Pablo hasn't even turned 30 yet. He had a .658 OPS, but over the last three seasons prior, he was roughly a ~.750 OPS hitter. If Panda (.765 OPS/104 wRC+) and Hanley (.820 OPS/120 wRC+) hit according to their steamer projections, I'll be happy.
EDIT: I'm going to preempt a response here. One could argue that you could assume the underlying individual player distribution. One way to consider it is that a player either makes a play, or fails to make a play per attempt. Therefore, you could assume that the underlying distribution of attempts represents a binomial distribution and calculate the variance based on that assumption. This is what Crystalline is referring to when he mentions the variance being +/- 15 runs. One problem is that this number comes from a single player, and not all players have the same number of opportunities (and therefore same variance, as mentioned by shaggydog). Furthermore, UZR makes lots of adjustments, which transform the data dramatically (see: tangotiger's response #9 in the
BBTF link upthread). I'm not sure anyone knows what the underlying distribution should look like. Therefore, bootstrapping is an effective alternative for estimating the variance of the underlying individual player population.