Adjusted Sim Scores
During a Hall of Fame discussion in this thread on Baseball Think Factory, similarity scores were cited as a piece of evidence comparing a few players for worthiness for election to the HOF.
For those of you who aren't familiar with sim scores, they were developed by Bill James as sort of a toy for showing who a player is most comparable too. Sean Forman, who runs baseball-reference.com describes the system here.
The largest problem with sim scores as conceived by James (and generally used) is the lack of adjustments - both for park and for era. I had done some work after the 2002 season to try to include those adjustments and get a more realistic sim score. I posted some of those results on the thread above and one poster asked for more information. Since my data was 4 years old (based on 2001 stats), I offered to re-run the scores with more up-to-date information (stats through 2004). So far, I've only done the work for batters.
Here's a brief overview of my methodology:
To begin with, I worked from the following list of assumptions:
1. Sim scores, as they are calculated today, are accurate beyond era and park concerns.
In other words, the values that James devised are the correct values to measure differences between players. This includes limiting the comparison to offensive productivity. Because defensive measurement is so widely debated, and because the stats needed to calculate a reliable defensive measure tend not to be available throughout the entire history of baseball, I chose to ignore the defensive value a player contributed.
2. Teams played a balanced schedule with no interleague play.
I made this assumption to simplify calculating park adjustments. Assuming a balanced schedule allowed me to calculate a park multiplier using the formula (BPF + LBPF)/200 where BPF equals the batters’ park factor for the home park and LBPF equals the batters’ park factor for the remainder of the league. There are some concerns with using the one-year park factors with no regression, but it seemed that the effects would be rather minimal since I’m interested in actual performance rather than prediction.
3. Park effects are even across all events.
Again, this was a simplifying assumption. It was much simply computationally to apply the single multiplier (discussed above) to all the stats, rather than try to figure out component effects – which probably would not even be possible for most of the older parks.
4. The sample size for any single position’s stats for a single year is too small to be an effective measure of the average performance.
Ideally, I would have liked to add a real positional adjustment to my calculation of the sim scores, rather than maintaining that proposed by James and used on Baseball Reference. Since my methodology is based on normalizing stats against an average player, that would mean creating an average player for each position for every year. I think the sample size for say American League Shortstops in 1964 is too small to use as population to measure against, but that’s really just personal opinion – I could be convinced otherwise. This choice made things easier computationally as well. I was able to ignore multiple positions in a year, rather than having to weight a player’s stats by his time spent at each position.
Taken those four points as givens, I used the following method to calculate my Adjusted Similarity Scores.
1. Park adjusted yearly stats for all major leaguers through 2001
This is pretty self-explanatory. I used the multiplier discussed above (BPF + LBPF)/200 to account for the park effects.
2. Calculated rate stats (occurrence per plate appearance) for all major leagues through 2001
I broke up the rate stats by league. For example, in 1914, I had rates for the major offensive states for the National League, the American League and the Federal League. This was a simple matter of dividing the number of times a particular event occurred by the total number of plate appearances for the league in that season.
3. Determined an average year for each player-season.
I did this by multiplying the rate stats times the actual plate appearances for a player in the season to determine what an average player would have done if he had the same number of chances.
4. Summed real yearly stats (park adjusted) and average player stats to create career stats.
Simply added the stats from each year to create two different careers for each player – the real (park adjusted) one and the “average” career. As part of this step I also figured the career batting average and slugging percentage by using the component stats.
5. Normalized career stats by subtracting “average” from real
Subtracting the “average” stats from the real stats gives us a value for how much a certain player exceeded or fell short of the average for his career. By comparing the actuals to the average (both determined on a yearly basis) we can account for the offensive level of the era – thus providing the era adjustment missing from James’ sim scores.
6. Calculated weighted positional value
I did basically what Sean Foreman does on Baseball Reference. We both calculated a positional value using James’ position scores and a weighted average of the positions played. Where we might differ is that Sean uses “primary position” and I use any appearance at the position in my calculation.
7. Ran the sim scores algorithm.
I used the same numbers as James and Sean so there’s no need to elaborate on this step.
Now let me get to the interesting part, the scores. Unfortunately, I can't figure out how to post the entire spreadsheet of all the scores onto Blogger. What I will do is post the scores for the major HOF candidates from this year's ballot. Anyone who wants the full list, just send me an email and I will send it out.
Clicking on the player's names will take you to their baseball-reference page.
Belle, Albert
935 - Manny Ramirez
927 - Juan Gonzalez
926 - Wally Berger
924 - Frank Howard
924 - Chuck Klein
911 - George Foster
910 - Rocky Colavito
909 - Rudy York
906 - Bob Johnson
898 - Duke Snider
Clark, Will
930 - Edgar Martinez
915 - Cecil Cooper
915 - Jack Fournier
913 - Bob Watson
911 - Keith Hernandez
911 - Bill Terry
910 - Jim Bottomley
909 - Ted Kluszewski
905 - Rico Carty
901 - Roger Connor
Concepcion, Dave
912 - Bill Russell
903 - Garry Templeton
901 - Leo Cardenas
899 - Alan Trammell
897 - Johnny Logan
897 - Marty Marion
897 - Cookie Rojas
895 - Bill Mazeroski
893 - Rick Burleson
891 - Rafael Ramirez
Dawson, Andre
899 - Billy Williams
888 - George Foster
884 - Dave Parker
883 - Tony Perez
882 - Goose Goslin
871 - Rafael Palmeiro
871 - Jim Rice
870 - Juan Gonzalez
869 - Ernie Banks
865 - Duke Snider
DiSarcina, Gary
954 - Rey Ordonez
947 - Felix Fermin
947 - Kevin Stocker
946 - Chris Gomez
945 - Pat Meares
945 - Jose Uribe
944 - Gene Michael
939 - Bucky Dent
937 - Tom Veryzer
935 - Buddy Kerr
Gaetti, Gary
922 - Tim Wallach
900 - Robin Ventura
894 - Sal Bando
889 - Doug DeCinces
886 - Larry Parrish
883 - Graig Nettles
875 - Dean Palmer
872 - Frank Thomas
871 - Tom Brunansky
871 - Deron Johnson
Garvey, Steve
909 - Cecil Cooper
901 - Al Oliver
898 - Dave Parker
893 - Jake Beckley
886 - Will Clark
884 - Hal Chase
884 - Bob Watson
882 - Ted Kluszewski
881 - Andres Galarraga
878 - Frank McCormick
Guillen, Ozzie
951 - Alfredo Griffin
924 - Don Kessinger
922 - Larry Bowa
906 - Rey Sanchez
902 - Tim Foli
900 - Omar Vizquel
894 - Bill Russell
892 - Roy McMillan
889 - Jose Vizcaino
888 - Mike Bordick
Jefferies, Gregg
966 - Lee Lacy
952 - Pete Fox
947 - Buddy Lewis
945 - Roberto Kelly
943 - Tony Gonzalez
942 - Lee Maye
940 - Lew Fonseca
939 - Amos Strunk
935 - Cleon Jones
935 - Jerry Mumphrey
Mattingly, Don
941 - George Burns
938 - Cecil Cooper
938 - Frank McCormick
923 - Bob Watson
922 - Hal Chase
916 - Carl Furillo
909 - Hal McRae
904 - Rico Carty
904 - Paul Hines
901 - Harry Davis
McGee, Willie
913 - Sam Rice
912 - Mickey Rivers
907 - Curt Flood
905 - Matty Alou
894 - Lloyd Waner
892 - Enos Cabell
885 - Vic Davalillo
884 - Doc Cramer
882 - Jerry Mumphrey
878 - Jose Cardenal
Morris, Hal
942 - Sean Casey
936 - Joe Start
935 - Danny Cater
935 - David Segui
930 - Dick Hoblitzel
929 - Joe Cunningham
926 - Dick Siebert
924 - Wes Parker
923 - Bob Boyd
922 - Warren Cromartie
Murphy, Dale
924 - Rocky Colavito
923 - Gil Hodges
915 - Jack Clark
914 - George Foster
911 - Darryl Strawberry
906 - Rudy York
901 - Bill Nicholson
899 - Roy Sievers
897 - Jose Canseco
895 - Eric Davis
Parker, Dave
914 - Tony Perez
912 - Goose Goslin
909 - Del Ennis
906 - Jim Rice
898 - Steve Garvey
897 - Andres Galarraga
887 - George Hendrick
886 - Al Kaline
884 - Andre Dawson
884 - Zack Wheat
Rice, Jim
922 - Andres Galarraga
917 - George Foster
914 - Willie Horton
908 - Ellis Burks
906 - Dave Parker
904 - Reggie Smith
903 - Billy Williams
902 - Goose Goslin
901 - Joe Adcock
901 - Frank Howard
Trammell, Alan
921 - Jack Glasscock
918 - Jay Bell
907 - Barry Larkin
903 - Luke Appling
902 - Jim Fregosi
901 - Leo Cardenas
901 - Alvin Dark
900 - Joe Sewell
899 - Dave Concepcion
898 - Dick Bartell
Weiss, Walt
960 - Bud Harrelson
923 - Spike Owen
921 - Ivan DeJesus
921 - Roy McMillan
920 - Mark Belanger
918 - Dick Schofield
917 - Scott Fletcher
916 - Mike Bordick
916 - Rey Sanchez
914 - Roger Metzger