Friday, December 30, 2005

Adjusted Sim Scores

During a Hall of Fame discussion in this thread on Baseball Think Factory, similarity scores were cited as a piece of evidence comparing a few players for worthiness for election to the HOF.

For those of you who aren't familiar with sim scores, they were developed by Bill James as sort of a toy for showing who a player is most comparable too. Sean Forman, who runs baseball-reference.com describes the system here.

The largest problem with sim scores as conceived by James (and generally used) is the lack of adjustments - both for park and for era. I had done some work after the 2002 season to try to include those adjustments and get a more realistic sim score. I posted some of those results on the thread above and one poster asked for more information. Since my data was 4 years old (based on 2001 stats), I offered to re-run the scores with more up-to-date information (stats through 2004). So far, I've only done the work for batters.

Here's a brief overview of my methodology:

To begin with, I worked from the following list of assumptions:

1. Sim scores, as they are calculated today, are accurate beyond era and park concerns.

In other words, the values that James devised are the correct values to measure differences between players. This includes limiting the comparison to offensive productivity. Because defensive measurement is so widely debated, and because the stats needed to calculate a reliable defensive measure tend not to be available throughout the entire history of baseball, I chose to ignore the defensive value a player contributed.

2. Teams played a balanced schedule with no interleague play.

I made this assumption to simplify calculating park adjustments. Assuming a balanced schedule allowed me to calculate a park multiplier using the formula (BPF + LBPF)/200 where BPF equals the batters’ park factor for the home park and LBPF equals the batters’ park factor for the remainder of the league. There are some concerns with using the one-year park factors with no regression, but it seemed that the effects would be rather minimal since I’m interested in actual performance rather than prediction.

3. Park effects are even across all events.

Again, this was a simplifying assumption. It was much simply computationally to apply the single multiplier (discussed above) to all the stats, rather than try to figure out component effects – which probably would not even be possible for most of the older parks.

4. The sample size for any single position’s stats for a single year is too small to be an effective measure of the average performance.

Ideally, I would have liked to add a real positional adjustment to my calculation of the sim scores, rather than maintaining that proposed by James and used on Baseball Reference. Since my methodology is based on normalizing stats against an average player, that would mean creating an average player for each position for every year. I think the sample size for say American League Shortstops in 1964 is too small to use as population to measure against, but that’s really just personal opinion – I could be convinced otherwise. This choice made things easier computationally as well. I was able to ignore multiple positions in a year, rather than having to weight a player’s stats by his time spent at each position.

Taken those four points as givens, I used the following method to calculate my Adjusted Similarity Scores.

1. Park adjusted yearly stats for all major leaguers through 2001

This is pretty self-explanatory. I used the multiplier discussed above (BPF + LBPF)/200 to account for the park effects.

2. Calculated rate stats (occurrence per plate appearance) for all major leagues through 2001

I broke up the rate stats by league. For example, in 1914, I had rates for the major offensive states for the National League, the American League and the Federal League. This was a simple matter of dividing the number of times a particular event occurred by the total number of plate appearances for the league in that season.

3. Determined an average year for each player-season.

I did this by multiplying the rate stats times the actual plate appearances for a player in the season to determine what an average player would have done if he had the same number of chances.

4. Summed real yearly stats (park adjusted) and average player stats to create career stats.

Simply added the stats from each year to create two different careers for each player – the real (park adjusted) one and the “average” career. As part of this step I also figured the career batting average and slugging percentage by using the component stats.

5. Normalized career stats by subtracting “average” from real

Subtracting the “average” stats from the real stats gives us a value for how much a certain player exceeded or fell short of the average for his career. By comparing the actuals to the average (both determined on a yearly basis) we can account for the offensive level of the era – thus providing the era adjustment missing from James’ sim scores.

6. Calculated weighted positional value

I did basically what Sean Foreman does on Baseball Reference. We both calculated a positional value using James’ position scores and a weighted average of the positions played. Where we might differ is that Sean uses “primary position” and I use any appearance at the position in my calculation.

7. Ran the sim scores algorithm.

I used the same numbers as James and Sean so there’s no need to elaborate on this step.

Now let me get to the interesting part, the scores. Unfortunately, I can't figure out how to post the entire spreadsheet of all the scores onto Blogger. What I will do is post the scores for the major HOF candidates from this year's ballot. Anyone who wants the full list, just send me an email and I will send it out.

Clicking on the player's names will take you to their baseball-reference page.

Belle, Albert
935 - Manny Ramirez
927 - Juan Gonzalez
926 - Wally Berger
924 - Frank Howard
924 - Chuck Klein
911 - George Foster
910 - Rocky Colavito
909 - Rudy York
906 - Bob Johnson
898 - Duke Snider

Clark, Will
930 - Edgar Martinez
915 - Cecil Cooper
915 - Jack Fournier
913 - Bob Watson
911 - Keith Hernandez
911 - Bill Terry
910 - Jim Bottomley
909 - Ted Kluszewski
905 - Rico Carty
901 - Roger Connor

Concepcion, Dave
912 - Bill Russell
903 - Garry Templeton
901 - Leo Cardenas
899 - Alan Trammell
897 - Johnny Logan
897 - Marty Marion
897 - Cookie Rojas
895 - Bill Mazeroski
893 - Rick Burleson
891 - Rafael Ramirez

Dawson, Andre
899 - Billy Williams
888 - George Foster
884 - Dave Parker
883 - Tony Perez
882 - Goose Goslin
871 - Rafael Palmeiro
871 - Jim Rice
870 - Juan Gonzalez
869 - Ernie Banks
865 - Duke Snider

DiSarcina, Gary
954 - Rey Ordonez
947 - Felix Fermin
947 - Kevin Stocker
946 - Chris Gomez
945 - Pat Meares
945 - Jose Uribe
944 - Gene Michael
939 - Bucky Dent
937 - Tom Veryzer
935 - Buddy Kerr

Gaetti, Gary
922 - Tim Wallach
900 - Robin Ventura
894 - Sal Bando
889 - Doug DeCinces
886 - Larry Parrish
883 - Graig Nettles
875 - Dean Palmer
872 - Frank Thomas
871 - Tom Brunansky
871 - Deron Johnson

Garvey, Steve
909 - Cecil Cooper
901 - Al Oliver
898 - Dave Parker
893 - Jake Beckley
886 - Will Clark
884 - Hal Chase
884 - Bob Watson
882 - Ted Kluszewski
881 - Andres Galarraga
878 - Frank McCormick

Guillen, Ozzie
951 - Alfredo Griffin
924 - Don Kessinger
922 - Larry Bowa
906 - Rey Sanchez
902 - Tim Foli
900 - Omar Vizquel
894 - Bill Russell
892 - Roy McMillan
889 - Jose Vizcaino
888 - Mike Bordick

Jefferies, Gregg
966 - Lee Lacy
952 - Pete Fox
947 - Buddy Lewis
945 - Roberto Kelly
943 - Tony Gonzalez
942 - Lee Maye
940 - Lew Fonseca
939 - Amos Strunk
935 - Cleon Jones
935 - Jerry Mumphrey

Mattingly, Don
941 - George Burns
938 - Cecil Cooper
938 - Frank McCormick
923 - Bob Watson
922 - Hal Chase
916 - Carl Furillo
909 - Hal McRae
904 - Rico Carty
904 - Paul Hines
901 - Harry Davis

McGee, Willie
913 - Sam Rice
912 - Mickey Rivers
907 - Curt Flood
905 - Matty Alou
894 - Lloyd Waner
892 - Enos Cabell
885 - Vic Davalillo
884 - Doc Cramer
882 - Jerry Mumphrey
878 - Jose Cardenal

Morris, Hal
942 - Sean Casey
936 - Joe Start
935 - Danny Cater
935 - David Segui
930 - Dick Hoblitzel
929 - Joe Cunningham
926 - Dick Siebert
924 - Wes Parker
923 - Bob Boyd
922 - Warren Cromartie

Murphy, Dale
924 - Rocky Colavito
923 - Gil Hodges
915 - Jack Clark
914 - George Foster
911 - Darryl Strawberry
906 - Rudy York
901 - Bill Nicholson
899 - Roy Sievers
897 - Jose Canseco
895 - Eric Davis

Parker, Dave
914 - Tony Perez
912 - Goose Goslin
909 - Del Ennis
906 - Jim Rice
898 - Steve Garvey
897 - Andres Galarraga
887 - George Hendrick
886 - Al Kaline
884 - Andre Dawson
884 - Zack Wheat

Rice, Jim
922 - Andres Galarraga
917 - George Foster
914 - Willie Horton
908 - Ellis Burks
906 - Dave Parker
904 - Reggie Smith
903 - Billy Williams
902 - Goose Goslin
901 - Joe Adcock
901 - Frank Howard

Trammell, Alan
921 - Jack Glasscock
918 - Jay Bell
907 - Barry Larkin
903 - Luke Appling
902 - Jim Fregosi
901 - Leo Cardenas
901 - Alvin Dark
900 - Joe Sewell
899 - Dave Concepcion
898 - Dick Bartell

Weiss, Walt
960 - Bud Harrelson
923 - Spike Owen
921 - Ivan DeJesus
921 - Roy McMillan
920 - Mark Belanger
918 - Dick Schofield
917 - Scott Fletcher
916 - Mike Bordick
916 - Rey Sanchez
914 - Roger Metzger

Monday, February 07, 2005

Currently listening to...

The Shins. Chutes Too Narrow

I've been really into this band since I saw Garden State (quickly becoming one of my favorite movies by the way). They have two songs on the soundtrack
, both off the first album Oh , Inverted World. This album engendered comparisons to early Beach Boys, which I didn't really get. Not that I disliked the music - I just didn't see any similarities to Pet Sounds.

With Chutes Too Narrow, I want to level-jump on the rock-and-roll icon comparison scale. Beatles comparisons are generally over-used, and under-realized, but the feeling I get when I listen to this album is a lot like early Beatles albums. The CD is chock full of 3 minute long poppy songs with a tremendous variety of sound.

Obviously writing about music isn't my forte, so I'll stop trying. It suffices to say that I recommend you pick up this CD at once. Check out the Pitchfork review.

Monday, January 31, 2005

Currently on the night stand

I'm currently reading The Metaphysical Club by Louis Menard. It's an exploration of the major American intellectual themes from the second half of the 19th Century. The main characters are Oliver Wendell Holmes, William James, Charles Pierce and John Dewey. So far it's been a very interesting look at the philosophical leanings of America and especially Boston after the Civil War. James, et al, are generally known for espousing the doctrine of pragmatism, but the book also touches on natural selection, legal theory, and the ascension of statistics and the law of errors into popular use. I'd recommend it to any student of intellectual history. Oh, and the Pulitzer people liked it too.