Back in the day (or 2005), the hedge fund I worked at ran a pretty serious NCAA bracket pool. There were also side bets, much like a Vegas shop would take prop bets, such as Kansas -150 to make the Final 4.

At this point, someone decided we needed to take a more academic approach to the tournament — what better way than simulation? The first year we tried, it ran heinously slow. I didn’t know a whole lot about VBA, and whoever programmed the thing forgot to turn off screen updating, so it took an entire day to play 10,000 simulations because each game’s winner literally advanced on the screen. The next year a former programmer wrote the simulation in R and it took a few minutes to run 10,000.

**Why simulation.** Simply advancing the best team (not necessarily seed) at each point seems like the 100% logical thing to do if you thought you had no skill in predicting upsets. The problem with that strategy is that it ignores path dependency.

A simplified example: The Final Four consists of four teams. The top two teams on one side of the bracket and the 3rd best team and a Cinderella 11 seed. The 3rd best team could have the best odds of winning because it’ll be in the championship game 80-90% of the time. Even if it only has a 40% chance of winning once it gets there, it’s got the best odds of anyone in the Final Four.

In the above case, you could just run the probabilities. But starting from the beginning of the tournament, there are 2^63 possible outcomes from the games, so if you wanted to know discrete probabilities, you’d need 9 quintillion calculations. Yuck. Simulation would allow you to do far fewer calculations and have a reasonable understanding of the distribution of likely outcomes.

**The issues.** First, you need to figure out a methodology of saying Team X is 60% to beat Team Y. For this you need a power ranking system. Luckily two systems are available. Kenpom and Dolphinsim. Each provides a ranking, and when any two team’s rankings are mathematically manipulated, you get a probability of winning. If team X is 60%, draw a random number uniformly distributed between 0 and 1 and if it’s less than .6, team X advances, otherwise, team Y does. Repeat a lot.

Second issue is home court advantage. Tournament home court advantage is a bit different than regular season because fans don’t control as many of the tickets. I made the somewhat arbitrary call that if a team is playing within 90 miles of campus, it gets half credit for typical home court advantage (when Cal played in San Jose in 2013, they only got a quarter, because it’s not a great fan base). There’s probably a more sophisticated way to do this, but not worth the brain damage, since the scenario generally only applies to large favorites in early rounds.

Lastly, is a momentum factor. This isn’t the largely debunked hot-hand theory. This is: 2010-11 Virginia Commonwealth played a crappy schedule through little fault of their own, so if they start knocking off some good teams, the simulator had better start adjusting their power rating up as they advance. Without this adjustment, lower seeds would make the Final 4 much less often than observed. So, I could have made a correction to bump team’s power ranking depending on who they beat. This might be a 2014 feature.

**Why simulation matters.** For one, it’ll give you an idea of who is most likely to win given their strength and path to the finals. You are really happy if this is not the team that everyone is picking. The biggest mistake people consistently make is not getting equity for their winner. If you choose the best team, but so does half of your pool, you got screwed. Try to find the team that has a decent chance and few will take.

The simulator is also good for prop bets. Thinking about whether or not the Pac 10 is good for over 4.5 wins or if taking the SEC under 9.5 makes my head hurt. I’d rather just let the objective data do its path-dependent thing and see how the results compare to the bookmakers.

I initially set up the simulations in a hybrid Excel / VBA environment, then Python and R. R turned out to be quick enough (1.5 minutes for 100,000 simulations of just the first module) and operationally easier than Python, which is a total pain in the ass to get running smoothly on a Mac. Results tend to converge after 10,000 or so simulations.

Here’s the setup — an array with four levels that passes through values as games are simulated. Tournament results are stored in different arrays (matrices) depending which outcome is being tested.

Inputs specific to 2013, the code in R:

# create the log5 function which determines game winner log5 <- function(power1, power2){ (power1 - power1*power2) / (power1 + power2 - 2*power1*power2) } # create a 3D array to store the data myarray <- array(0, dim = c(64,7,4)) # slots the 64 teams with unique identifiers myarray[,1,1] = 1:64 # read pythagoreans from text on the web myarray[,1,3] <- data.matrix(read.table(url("http://www.murphyspunch.com/pythag_2013.txt"))) # home court adjustments myarray[1,2:3,4] <- .01 myarray[11,2:3,4] <- .01 myarray[23,2:3,4] <- .01 myarray[31,2:3,4] <- .01 myarray[33,2:3,4] <- .01 myarray[39,2:3,4] <- .01 myarray[49,4:5,4] <- .005 myarray[54,2:3,4] <- .005 # matrix to convert unique IDs to seed numbers convert_seed <- matrix(c(1,1,2,16,3,8,4,9,5,5,6,12,7,4,8,13,9,6,10,11,11,3,12,14,13,7,14,10,15,2,0,15), nrow=16, ncol=2, byrow=T) # sets total simulations max_reps = 10000 # Module 1 holders: survival analysis # array to store wins cumwins <- array(0, dim = c(64,6)) # Module 2 holders: seeds advancing total_seeds <- matrix(0, max_reps, 6) one_seeds <- matrix(0, max_reps, 6) two_seeds <- matrix(0, max_reps, 6) three_seeds <- matrix(0, max_reps, 6) four_seeds <- matrix(0, max_reps, 6) five_seeds <- matrix(0, max_reps, 6) # Module 3 holders: Conferences advancing acc <- matrix(0, max_reps, 6) atlantic_ten <- matrix(0, max_reps, 6) big_ten <- matrix(0, max_reps, 6) big_east <- matrix(0, max_reps, 6) big_twelve <- matrix(0, max_reps, 6) mvc <- matrix(0, max_reps, 6) mountain_west <- matrix(0, max_reps, 6) pac_twelve <- matrix(0, max_reps, 6) sec <- matrix(0, max_reps, 6) wcc <- matrix(0, max_reps, 6) # Module 4 holders: total wins for prop teams louisville <- matrix(0, max_reps, 6) indiana <- matrix(0, max_reps, 6) duke <- matrix(0, max_reps, 6) florida <- matrix(0, max_reps, 6) # start the master loop reps <- 1 while (reps <= max_reps) { # Play a tournament j <- 1 while (j <= 6) { i <-1 while (i <= 2^(6-j)) { play_game = runif(1) # pass along winning team myarray[i, j+1, 1] <- if(log5(myarray[2*i-1,j,3] + myarray[2*i-1,j+1,4], myarray[2*i,j,3] + myarray[2*i,j+1,4]) > play_game) { myarray[2*i-1,j,1] } else { myarray[2*i,j,1] } # pass along that team's seed myarray[i, j+1, 2] <- convert_seed[,2][match(myarray[i,j+1,1]%%16,convert_seed[,1])] # pass along that team's home court adjusted pythag myarray[i, j+1, 3] <- if(log5(myarray[2*i-1,j,3]+myarray[2*i-1,j+1,4],myarray[2*i,j,3] + myarray[2*i,j+1,4]) > play_game) { myarray[2*i-1,j,3] + myarray[2*i-1,j+1,4] - myarray[2*i-1,j,4] } else { myarray[2*i,j,3] + myarray[2*i,j+1,4] - myarray[2*i,j,4] } i <- i+1 } j <- j+1 } # Module 1: Count the wins j <- 1 while (j <= 6) { i <-1 while (i <= 64) { cumwins[i,j] <- sum(myarray[,j+1,1]==i) + cumwins[i,j] i <- i+1 } j <- j+1 } # Module 2: Store seed info # Count the sum of seeds total_seeds[reps,] <- apply(myarray[,2:7,2], 2, sum) # Count number of seeds advancing by seed one_seeds[reps,] <- apply(myarray[,2:7,2], 2, function(x) sum(x==1)) two_seeds[reps,] <- apply(myarray[,2:7,2], 2, function(x) sum(x==2)) three_seeds[reps,] <- apply(myarray[,2:7,2], 2, function(x) sum(x==3)) four_seeds[reps,] <- apply(myarray[,2:7,2], 2, function(x) sum(x==4)) five_seeds[reps,] <- apply(myarray[,2:7,2], 2, function(x) sum(x==5)) # Module 3: Store conference info acc[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==15) + sum (x==35)+ sum (x==51) + sum (x==63)) atlantic_ten[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==7) + sum (x==24) + sum (x==37) + sum (x==52) + sum (x==57)) big_ten[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==11) + sum (x==21) + sum (x==31) + sum (x==39) + sum (x==42)+ sum (x==49) + sum (x==61)) big_east[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==1) + sum (x==14) + sum (x==19) + sum (x==29)+ sum (x==36) + sum (x==47) + sum (x==55) + sum (x==59)) big_twelve[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==5) + sum (x==23) + sum (x==30) + sum (x==33) + sum (x==46)) mvc[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==13) + sum (x==20)) mountain_west[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==3) + sum (x==27) + sum (x==45) + sum (x==53)) pac_twelve[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==6) + sum (x==25) + sum (x==41) + sum (x==54) + sum (x==62)) sec[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==4) + sum (x==22) + sum (x==43)) wcc[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==10) + sum (x==17)) # Module 4: Store individual team info louisville[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==1)) indiana[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==49)) duke[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==15)) florida[reps,] <- apply(myarray[,2:7,1], 2, function(x) sum(x==43)) # next tourney reps <- reps + 1 } # Some data outputs # Sum of seeds in Final 4 hist(total_seeds[,4], breaks=36, density=50, col="darkorange3", xlim = range(4:40), main="Sum of Seeds in the Final Four (10,000 simulations)") # Odds of x one seeds advancing sum(one_seeds[,4]==0)/ max_reps sum(one_seeds[,4]==1)/ max_reps sum(one_seeds[,4]==2)/ max_reps sum(one_seeds[,4]==3)/ max_reps sum(one_seeds[,4]==4)/ max_reps # Find total wins by 1 seeds compared to over/under one_seeds_condensed <- apply(one_seeds,1,sum) one_seeds_ou <- 12 sum(one_seeds_condensed < one_seeds_ou)/ max_reps sum(one_seeds_condensed > one_seeds_ou)/ max_reps sum(one_seeds_condensed == one_seeds_ou)/ max_reps # Find total wins by 2 seeds compared to over/under two_seeds_condensed <- apply(two_seeds,1,sum) two_seeds_ou <- 9 sum(two_seeds_condensed < two_seeds_ou)/ max_reps sum(two_seeds_condensed > two_seeds_ou)/ max_reps sum(two_seeds_condensed == two_seeds_ou)/ max_reps # 15 Seed advances sum(two_seeds[,1] < 4)/max_reps # 13 or 14 Seed advances sum(three_seeds[,1] + four_seeds[,1] < 8)/max_reps # 14 or 15 Seed advances sum(three_seeds[,1] + two_seeds[,1] < 8)/max_reps # 12 Seeds to advance (1 minus 5 Seeds) sum(five_seeds[,1]==0)/ max_reps sum(five_seeds[,1]==1)/ max_reps sum(five_seeds[,1]==2)/ max_reps sum(five_seeds[,1]==3)/ max_reps sum(five_seeds[,1]==4)/ max_reps # conference template condensed <- apply(acc,1,sum) ou <- 6.5 under <- sum(condensed < ou)/max_reps over <- sum(condensed > ou)/max_reps push <- sum(condensed == ou)/max_reps under over push summary(condensed) # team template team_temp <- apply(indiana,1,sum) ou <- 3.5 under <- sum(team_temp < ou)/max_reps over <- sum(team_temp > ou)/max_reps push <- sum(team_temp == ou)/max_reps under over push summary(team_temp)