A career in computer network administration resulted in me becoming somewhat proficient at writing computer scripts to manipulate text data.Ha, definitely not. I’m curious what he’s looking at here. I’d like a little more explanation of this data.
I wrote a script to download a series of web pages like the example links below, with which I have no doubt you are already familiar:
https://www.baseball-reference.com/players/split.fcgi?id=duranja01&year=2024&t=b
https://www.baseball-reference.com/teams/BOS/2024-batting.shtml
Thus, over the past few months I have been downloading the HTML pages of BBRef’s Batting Splits and Team Baserunning/Misc data. Typically, I will feed a list of all MLB players in a given year to my script that will generate the web link. Below is Devers’ splits for 2024. If you look closely, the link differs from Durans’ above only in the unique player ID.
https://www.baseball-reference.com/players/split.fcgi?id=deverra01&year=2024&t=b
I put a “sleep“ function of a few seconds in the code as it generates web links and downloads the HTML data so it does not slam the crap out of the BBRef web site. Then I let it run slowly all night.
I have done this for all players and all teams since 1988 (when Count/Balls-Strikes data became available as my initial unfruitful project was related to pitch counts) – and also for selected years/teams before 1988.
For example, 1950 interest me because the 1950 Red Sox scored 1,027 runs – the most since Lou Gehrig got sick. The 1999 Indians were the other team to score over 1,000 runs (1,009) post-Gehrig but the Sox had no DH, nor much in the way of applied chemistry, and Ted Williams played only 89 games because he broke his elbow during the All Star game. So, I study that team a bit.
I have downloaded a lot of data from BBRef – just the Batting Splits are close to 59,000 files and 27 GB of data. The base running data is smaller. I have also downloaded some Plate Discipline data from Baseball Savant – but it is a good bit tricker to parse and it is only available from 2015 on.
I then create scripts to parse those HTML files, extract the data I want and write it to .CVS files that a spreadsheet can open.
For example the 2024 Batting Splits CVS file has 290 columns with headings like:
Year,Team,Player,Season Games Played,Season Games Started,Season Plate Appearances,Season At Bats,Season Runs,Season Hits,Season Doubles,Season Triples,Season Home Runs,Season Walks,Season Strike Outs...
Bases Empty Plate Appearances, Bases Empty At Bats, Bases Empty Runs,Bases Empty Hits, Bases Empty Doubles, Bases Empty Triples, Bases Empty Home Runs...Man On First Plate Appearances, Man On First At Bats, Man On First Hits, Man On First Doubles, Man On First Triples,Man On First Home Runs…
That 2024 Batting Splits file has 727 rows – one for each MLB player that at had least one Plate Appearance this past season.
Similar data has been downloaded for the BBRef base running data – parse downloaded HTML files and put the relevant data in to .CSV files.
Then, more scripts to spin out the data in the .CVS files to generate the kinds of numbers and relationships I posted above.