### From German Tanks to stochastic fortune cookies

I've ignored plenty of fortune cookies in my time, but after lunch the other day I noticed the "lucky numbers" on the back of each fortune cookie and wondered how they were chosen. Here are my group's lucky numbers:

While the lowest number is presumably 1 (which we observed), it was unclear what the maximum lucky number could be. Given the limited number of data points on hand compared to the sample space, it would be hard to address the distribution question directly. But quantifying uncertainty about the maximum number $N$ is extremely similar to the Bayesian German Tank Problem tackled in the last post.

## Starting with small data¶

Let's set up the problem and see what range we can narrow $N$ down to using only these four fortunes.

In [1]:
from __future__ import print_function

%matplotlib inline
import numpy as np
import pymc3 as pm
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

np.random.seed(12345)

# our data
y = np.array([[35, 26, 56, 10, 32, 52],
[26, 38, 2, 23, 27, 50],
[11, 34, 24, 49, 2, 45],
[14, 1, 55, 19, 32, 45]])

def sample_upper_bound(data, upper=10000, burn_in=10000, n=100000):
""" set up a model for the upper bound N based on the data and sample """
model = pm.Model()
with model:
# the upper bound
N = pm.DiscreteUniform("N", lower=0, upper=upper)

# likelihood - P(D|N): y ~ uniform(1, N)
y_obs = pm.DiscreteUniform("y_obs", lower=1, upper=N, observed=data)

# Metropolis-Hastings for discrete variables
step = pm.Metropolis()

# we'll use four chains, and parallelize to four cores
trace = pm.sample(n, step, chain=4, njobs=4, progressbar=False)

# discard a chunk of samples from the beginning for burn-in
trace = trace[burn_in:]

return trace

def show_output(trace):
""" display a summary of the sampling results """
print(pm.summary(trace))

# plot the trace
pm.traceplot(trace)
plt.show()

trace = sample_upper_bound(y.ravel())
show_output(trace)

N:

Mean             SD               MC Error         95% HPD interval
-------------------------------------------------------------------

58.079           2.679            0.016            [56.000, 63.000]

Posterior quantiles:
2.5            25             50             75             97.5
|--------------|==============|==============|--------------|

56.000         56.000         57.000         59.000         65.000

None


According to the sampler, 95% of the posterior density for $N$ lies in $[56,63]$. We have narrowed down the range considerably. We can also "ask" our sample about $P(N > 56 \mid \mathcal D)$, or what the probability is that we haven't yet observed the true maximum in the data:

In [2]:
def probability_n_greater_than_max_observed(trace):
greater = trace.get_values('N') > y.max()
return greater.sum() / float(greater.size)

print("{:0.4%}".format(probability_n_greater_than_max_observed(trace)))

66.6681%


## A digression¶

Apparently, lots of people use these fortune cookie numbers to pick their lottery numbers. We know this because in 2005 there were 110 second place Powerball winners for exactly this reason:

The Powerball drawing on March 30, 2005 produced 110 second-prize winners. The total payout to these winners was \$19.4 million, with 89 winners receiving \$100,000 each, while the other 21 winners received \$500,000 each as they were Power Play selections. [Multi-State Lottery Association] officials initially suspected fraud or a reporting error. However, all 110 winners had played numbers from fortune cookies made by Wonton Food Inc. of Long Island City, New York. The factory had printed the numbers "22, 28, 32, 33, 39, 40" on thousands of fortunes. The New Yorker expanded on this strange incident, interviewing a VP at Wonton Food, Inc which produced the fortune cookies in question: Each day, Wonton’s factory churns out four million Golden Bowl-brand cookies, which are sold to several hundred venders, who, in turn, sell them to most of the forty thousand Chinese restaurants across the country. Wonton’s primacy in the industry and, for that matter, in the gambler’s imagination is such that when, in March, five of six lucky numbers printed on a fortune happened to coincide with the winning picks for the Powerball lottery, a hundred and ten people, instead of the usual handful, came forward to claim prizes of around a hundred thousand dollars. Lottery officials suspected a scam until they traced the sequence to a fortune printed with the digits "22-28-32-33-39-40" and Donald Lau’s prediction: "All the preparation you’ve done will finally be paying off." "We’ve had winners before, but never this many," Lau said the other day, in his East Williamsburg office, which is furnished with stacks of financial reports and “A Dictionary of American Proverbs.” "A computer picks the numbers, not me. If only a computer could also write the fortunes." Well the good news for Mr. Lau is that a Markov Chain apparently can write semi-plausible fortunes. But as the largest fortune cookie manufacturer in the United States, the way that Wonton Food picks Lucky Numbers has a small but amusing impact on lottery odds. ## Gathering more data: the plot thickens¶ How can we gather more lucky numbers in order to infer more about how they are chosen? The problem is, we can't assume that every fortune cookie company uses the same method for choosing numbers, so in order to keep digging it will be necessary to establish some fortune cookie provenance: that is, which company produced these fortunes, and how we can observe more numbers. We do have one clue here that prevents the fortunes we saw from being completely generic. As you may have noticed, there is a message printed above the lucky numbers we saw: How about another Fortune? SecondFortune.com  This site is run by Wonton Food, Inc, so apparently our fortunes came from them; not surprising since they are the largest supplier of fortune cookies to Chinese restaurants in the United States. With this knowledge in hand, I took a closer look at the company's web presence and found their Twitter feed, where they often post pictures of their fortune cookies just like this one: If you look closely, you can see a few lucky numbers on the other side of the paper. Same thing here: Sometimes the numbers are even right out front: So with some very painstaking social media stalking, we could maybe double the amount of the data. But look at these fortunes for a moment — while they are very inconsistently styled, you may have detected that they all have similar blue corner artifacts. According to a recent "99% Invisible" podcast about the history of fortune cookies, these blue corners are a telltale hallmark of Wonton Food fortune cookies. Armed with this knowledge, the use of Google image search unlocks quite a bit more data. For example, search terms like "fortune cookie lucky numbers" result in many relevant data points: Here are 85 more fortunes that look to be made by Wonton Food, Inc: In [3]: more = """ 1 15 18 20 30 48 # http://cdn.smosh.com/sites/default/files/legacy.images/smosh-pit/092010/cookie-28.jpg 2 17 28 34 8 22 # http://4.bp.blogspot.com/_ZgdRrxdmW8c/S4VoEESgGCI/AAAAAAAAAnc/2bJJnwlJ6Xo/s400/bday+fortune+cookie.jpg 2 50 29 13 9 19 # http://www.blog.themacgroups.com/wp-content/uploads/2014/05/0507141400d.jpg 2 51 55 25 31 50 # http://farm6.static.flickr.com/5294/5579768060_0954248504.jpg 4 24 17 30 39 43 # http://3.bp.blogspot.com/-AR5mie8G25M/Ucer_nCzj_I/AAAAAAAAKr4/QOaOBS_LHX4/s1600/photo+5.JPG 4 25 50 11 15 40 # http://2.bp.blogspot.com/_cWYMJRWCfs8/S7tAmnfCDdI/AAAAAAAABFA/IyQ-qugqZ4s/s1600/030.JPG 4 43 3 32 37 6 # https://secure.cabot.net/images/emails/CWA/Screenshot/Cookie-back.png 5 8 16 24 32 46 # http://3.bp.blogspot.com/-rvrZeRdobgw/UwpByfWQeuI/AAAAAAAAAJk/EdAys8exDIQ/s1600/fortune+cookie.JPG 5 22 34 11 2 17 # http://j-walkblog.com/images/fortune1.jpg 5 31 2 47 39 18 # http://cdn.acidcow.com/pics/20101209/hilarious_fortune_19.jpg 7 10 17 23 31 40 # http://2.bp.blogspot.com/-LjaA2JqnMcc/TfBCTgLONVI/AAAAAAAAAAY/wya8xFVHnWw/s1600/021.JPG 7 22 31 43 5 30 # http://3.bp.blogspot.com/-0ay1RDi8_U8/T85T_nJSN7I/AAAAAAAAAKY/YzdFtLOPqa0/s1600/cookie-12.jpg 8 11 17 23 25 45 # http://www.dumpaday.com/wp-content/uploads/2011/05/fortune-cookies-fortunes32-640x426.jpg 9 13 18 24 33 46 # http://anunmarriedwoman.com/wp-content/uploads/2010/10/photo3-e1287272247993-225x300.jpg 9 26 1 50 23 39 # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg 10 13 15 20 25 28 # http://laynewong.com/wp-content/uploads/2013/09/fortune-cookie.jpg 10 13 18 31 35 36 # https://melissajanda.files.wordpress.com/2013/04/fortune1-e1367003950889.jpg 10 15 19 23 33 34 # http://kinetic.org/fortune.jpg 10 22 38 49 5 47 # http://static.fjcdn.com/pictures/Troll_07b744_1477660.jpg 11 7 25 38 5 17 # http://1.bp.blogspot.com/-VWsMoh42x2w/T4blcYLYYrI/AAAAAAAAaxU/0l7-Ojb98tQ/s1600/565%2312yoda.jpg 12 8 56 37 45 26 # https://scontent.cdninstagram.com/hphotos-xaf1/l/t51.2885-15/s640x640/sh0.08/e35/12530997_1123086017703550_1469065506_n.jpg 12 15 19 26 35 40 # http://www.lifeofanarchitect.com/wp-content/uploads/2014/05/Fortune-Cookie-600x450.jpg 14 15 16 17 26 38 # https://247wallst.files.wordpress.com/2015/09/fortune-cookie.png 14 15 20 23 35 50 # http://www.sportsbet.com.au/blog/wp-content/uploads/melbourne-cup-lucky-numbers.jpg 14 21 16 42 32 11 # http://blog.waygoapp.com/wp-content/uploads/2014/02/dscf3302-1.jpg 15 3 42 22 14 25 # http://data.whicdn.com/images/11510916/large.jpg 16 18 22 56 17 38 # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg 16 47 30 51 44 14 # http://i.imgur.com/FhfYlTd.jpg 17 19 4 50 18 27 # https://s-media-cache-ak0.pinimg.com/236x/f6/95/14/f695140d8c53d1e731598120ca97e483.jpg 17 28 38 37 44 2 # http://ide-a.net/images/fortune006.jpg 18 12 11 33 52 55 # http://i.imgur.com/bun1qOh.jpg?1 18 23 32 34 39 41 # http://anunmarriedwoman.com/wp-content/uploads/2010/10/photo3-e1287272247993-225x300.jpg 19 8 24 22 52 53 # https://pbs.twimg.com/media/CNHlEc_WEAAIAB0.jpg 19 21 23 25 27 29 # https://www.chaossoftware.com/blog/wp-content/uploads/2013/06/fortunecookie-300x101.jpg 20 11 31 47 26 8 # https://s-media-cache-ak0.pinimg.com/736x/39/38/45/39384572487bf20c088439b9bed0e153.jpg 20 13 1 9 10 16 # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg 20 37 5 17 3 18 # https://keenanevans.files.wordpress.com/2010/01/fortune-cookies1.jpg 21 17 16 12 27 30 # http://i.imgur.com/B9kD7Ff.png 21 44 28 33 14 8 # http://www.nsmbl.nl/wp-content/uploads/2011/11/345157670_8f57c10e6a_z1-610x250.jpg 22 38 42 55 33 18 # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg 23 22 18 36 38 47 # http://www.dandelionchocolate.com/wp-content/uploads/2011/08/fortune-500x377.png 24 35 36 17 8 39 # http://www.tiptoptens.com/wp-content/uploads/2011/02/funny-fortune-cookies-8.jpg 26 36 46 17 48 6 # http://www.tootimidandsqueamish.com/wp-content/uploads/2012/03/Learn-Chinese-1024x629.jpg 27 5 16 34 14 8 # https://www.playhugelottos.com/uploads/assets//Pictures/fortune_cookie_lottery_numbers.JPG 28 13 19 56 10 23 # http://1.bp.blogspot.com/_xoUjXFG3_sI/SwnCqR206SI/AAAAAAAAACw/aGF-6fri3TU/s1600/cheese+003.jpg 28 24 19 1 36 4 # http://farm4.static.flickr.com/3295/2996524954_9bc76a8cc8.jpg 28 29 16 52 38 14 # https://s-media-cache-ak0.pinimg.com/236x/64/83/00/64830093931ea68f7557b1540854c96d.jpg 28 33 46 5 15 2 # http://www.asian-central.com/stuffasianpeoplelike/wp-content/uploads/HLIC/48816c7764a842c9bebf9ce302a6812a.gif 28 54 22 52 47 9 # http://cdn3.geckoandfly.com/wp-content/uploads/2014/08/Blog-quote-fortune-cookie.jpg 29 27 46 54 28 2 # http://www.blog.themacgroups.com/wp-content/uploads/2014/05/0507141400d.jpg 30 1 46 14 44 26 # http://1.bp.blogspot.com/_1ChRCAfQNM4/SFZzGqNmhSI/AAAAAAAAAQI/R6sf9fe3qac/s400/Fortune+001.jpg 30 25 17 38 46 2 # http://i.imgur.com/0NQDGRJ.jpg 32 1 19 53 13 22 # http://static.fjcdn.com/pictures/When_8a02ca_1502019.jpg 32 31 46 27 38 2 # https://enlustered.files.wordpress.com/2014/09/img_1093.jpg 32 41 12 11 54 45 # https://s-media-cache-ak0.pinimg.com/236x/da/4d/72/da4d7233fc19791f245d96f0e90ccd91.jpg 32 45 15 17 52 11 # https://s-media-cache-ak0.pinimg.com/236x/75/cf/08/75cf082199cb1d61632173bff59d05a7.jpg 33 40 7 25 4 28 # http://www.barnorama.com/wp-content/galleries/02/cookie/10.jpg 34 11 27 42 5 37 # http://www.andrewlove.org/blog/blogpics/fortune.jpg 34 17 22 28 36 5 # http://images.baklol.com/2_jpeg2d02ed04e752e850e7bab0895197b888.jpeg 34 41 52 35 42 51 # https://s-media-cache-ak0.pinimg.com/236x/5d/b6/bb/5db6bb1a37b590e1b385ab8a09519864.jpg 34 56 8 17 7 1 # http://img.ifcdn.com/images/89d23a5ceb47420e268615daf45cdb817ad7a5d3d342f874ee54fff2524e2c34_1.jpg 35 32 39 21 38 48 # https://s-media-cache-ak0.pinimg.com/236x/33/0b/90/330b900102b6a82879f1bdc8631f5e73.jpg 37 7 42 35 22 49 # http://i2.cdn.turner.com/cnnnext/dam/assets/140306221521-dnt-womans-lucky-fortune-cookie-wins-millions-00012928-story-top.jpg 37 18 10 17 26 47 # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg 37 50 30 33 53 52 # http://i.imgur.com/FhfYlTd.jpg 39 24 15 44 21 43 # http://www.barnorama.com/wp-content/galleries/02/cookie/15.jpg 39 53 42 48 31 55 # https://s-media-cache-ak0.pinimg.com/236x/6c/4e/24/6c4e24eb8b53f3a6f4e7def63c2d34e1.jpg 41 20 18 9 22 17 # https://s-media-cache-ak0.pinimg.com/600x315/16/1a/62/161a627822d288cbfd38403c7b56f83c.jpg 41 25 27 52 11 24 # http://3.bp.blogspot.com/-SFfBdkDLk60/Tg1MB3mt9BI/AAAAAAAAAD4/bbvtCwFDPTo/s1600/photo%255B1%255D.PNG 41 35 15 16 29 9 # http://25.media.tumblr.com/tumblr_le6571kvjv1qb87jao1_500.jpg 41 49 46 16 51 12 # http://i.imgur.com/i353DZL.jpg 42 6 47 26 24 46 # http://i.imgur.com/bun1qOh.jpg?1 43 54 38 42 20 18 # https://s-media-cache-ak0.pinimg.com/236x/d4/e9/a5/d4e9a54a4c3c0da773fa72c56d7ef92c.jpg 43 55 26 45 7 10 # https://iamemilyanne.files.wordpress.com/2012/09/fortune-cookie_new-house.jpg 44 1 24 36 47 32 # http://3.bp.blogspot.com/-AR5mie8G25M/Ucer_nCzj_I/AAAAAAAAKr4/QOaOBS_LHX4/s1600/photo+5.JPG 45 21 33 17 3 20 # https://creativejamie.files.wordpress.com/2010/07/you-will-receive-a-fortune-cookie.jpg 46 2 50 10 36 53 # https://c1.staticflickr.com/9/8174/8035925042_13d272f86d_z.jpg 46 19 28 4 27 3 # https://twitter.com/wontonfood/status/737657278740987904 47 37 4 25 44 15 # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg 48 3 28 43 24 16 # https://pbs.twimg.com/media/BEn61IgCEAE7tkp.jpg 49 32 28 38 7 43 # http://static.fjcdn.com/pictures/When_8a02ca_1502019.jpg 50 26 14 27 36 30 # http://img.ifcdn.com/images/89d23a5ceb47420e268615daf45cdb817ad7a5d3d342f874ee54fff2524e2c34_1.jpg 50 47 22 1 35 20 # https://www.tomslatin.com/wp-content/uploads/Learn-Chinese-Dry-Cleaning.jpg 53 37 1 17 32 42 # https://webb63.files.wordpress.com/2013/02/fortune-learn-chinese.jpg 55 26 25 51 11 27 # https://ashleyburnette.files.wordpress.com/2011/09/011.jpg""" # split this text block into numbers more_numbers = np.array([[int(n) for n in line.split('#')[0].strip().split()] for line in more.split('\n') if line]) # combine the numbers together y_all = np.vstack([y, more_numbers]) y_all.shape  Out[3]: (89, 6) ## Sampling again with more data¶ Now that we have assembled all of this extra data, we expect that the variance will be reduced in our posterior distribution over the lucky number upper bound$N$. Let's run the sampler again. In [4]: trace = sample_upper_bound(y_all.ravel()) show_output(trace)  N: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 56.000 0.010 0.000 [56.000, 56.000] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| 56.000 56.000 56.000 56.000 56.000 None  As we can see, the posterior distribution over$N$has tightened up considerably. We are now quite sure that$N=56$, implying that all Wonton Food lucky numbers fall in the interval$[1,56]$. Looking at the sampling trace, we can see that virtually any other proposed value for$N$was rejected except for a handful of times the value 57 was accepted. What I find really interesting is how close we got using only the first four fortune cookies. In fact, here is how the uncertainty diminishes as we open more and more fortune cookies: In [5]: %%time from tqdm import tqdm # shuffle the fortunes to random order shuffled = y_all.copy() np.random.shuffle(shuffled) # create an empty matrix to hold all the results m, n = len(y_all), (50000 - 10000) * 4 # n_samples minus burn-in times number of chains results = np.zeros(shape=(m, n), dtype=np.int) # now run the sampler with one additional fortune in each iteration and keep track of results for i in tqdm(range(m)): data_so_far = shuffled[:i+1] trace = sample_upper_bound(data_so_far.ravel(), n=50000, burn_in=10000) results[i, :] = trace.get_values('N')  100%|██████████| 89/89 [12:22<00:00, 8.81s/it] CPU times: user 2min 40s, sys: 11.2 s, total: 2min 51s Wall time: 12min 22s  In [9]: from scipy import stats xs = np.arange(1, m + 1) highs = np.percentile(results, 95.0, axis=1) mode = stats.mode(results, axis=1).mode.ravel() fig, ax = plt.subplots(figsize=(12, 4)) plt.fill_between(xs, mode, highs.ravel(), alpha=0.5, color="gray", label='95% highest posterior density') plt.plot(xs, mode, label="Maximum a posteriori estimate") plt.ylim(45, 80) plt.xlabel('fortune seen') plt.ylabel('estimated$N$') plt.legend() plt.show()  Unlike the probabilistic higher bound, the lower bound on$N$is just based on the random order of the fortunes, in that it takes on the value of the highest number we happen to have seen so far. But as we can see, it didn't take very many fortunes before we had extremely high certainty that$N=56$. ## A second digression¶ At the time of writing, if you visit SecondFortune.com you will be greeted with a set of lucky numbers like the following: As it turns out, they have a snippet of Javascript here choosing numbers from 1 to 99: for (var i = 0; i < 6; i++) { var luckyNumber = Math.floor(Math.random() * 99) + 1;$('.lucky-numbers > span').eq(i).html(luckyNumber);
};


Come on guys, get it together.

## The distribution question¶

Now that we feel confident about the sample space, we might wonder about the number selection methodology. One natural question is whether all numbers have an equal probability of being chosen. Here are overall frequency counts for all of the numbers:

In [10]:
fig, ax = plt.subplots(figsize=(12, 4))
counts, edges = np.histogram(y_all.ravel(), bins=np.arange(1, 58))
plt.bar(edges[:-1], counts)
plt.xlim(1, 56)
plt.xlabel('value')
plt.ylabel('count')
plt.show()


We can also see how the counts break down by placement within the ordered lucky number vector:

In [11]:
fig, ax = plt.subplots(figsize=(12, 4))
number_by_position = pd.melt(pd.DataFrame(y_all))
number_by_position = number_by_position.rename(columns={'variable': 'position'})
sns.swarmplot(x='position', y='value', data=number_by_position)
plt.show()


It's hard to tell just by inspection whether certain numbers are favored or not. For example, 17 looks overrepresented, but that could have happened randomly. One thing that is more clear now is that these numbers are being sampled without replacement:

In [12]:
def count_occurrences(fortune):
values, counts = np.unique(fortune, return_counts=True)
return counts

def all_unique(fortune):
counts = count_occurrences(fortune)
return (counts == 1).all()

unique_within_fortunes = all(all_unique(fortune) for fortune in y_all)
print('Numbers are unique within every fortune:', unique_within_fortunes)

Numbers are unique within every fortune: True


## "Future work" (not really though)¶

Although everyone will rightfully be running out patience with this problem by now, it is interesting to think about how we might model the selection process to learn more about the underlying probabilities (i.e. whether there is bias in how numbers are chosen, and exactly what that bias is).

For example, one approach would be to ignore some complexity by glossing over the sampling without replacement in each individual fortune, and to treat the numbers as 534 rolls of a possibly unbalanced 56-sided die. The distribution involved is the multinomial, so we might try to fit a Dirichlet-multinomial model and draw inferences about the different probabilities by the resulting Dirichlet parameter.

For sampling without replacement, Wikipedia gives the following "urn interpretation" of the multivariate hypergeometric distribution:

The model of an urn with black and white marbles can be extended to the case where there are more than two colors of marbles. If there are $K_i$ marbles of color $i$ in the urn and you take $n$ marbles at random without replacement, then the number of marbles of each color in the sample ($k_1$,$k_2$,...,$k_c$) has the multivariate hypergeometric distribution. This has the same relationship to the multinomial distribution that the hypergeometric distribution has to the binomial distribution—the multinomial distribution is the "with-replacement" distribution and the multivariate hypergeometric is the "without-replacement" distribution.

That seems like a promising place to start, though we will likely end up looking at a version that models bias if we want to allow for the possibility that probabilities are unequal while still requiring only one of each number. It even looks like there is an R library called BiasedUrn which does exactly that.

If anyone tries it out, I'd be interested to hear how it goes.

Any comments or suggestions? Let me know.