The news is now out:

We are now 'Aurizn'.

The Elusive Lighthouse: Part 1

Stuart Burrows 16 Dec 2016 7 minute read
The Elusive Lighthouse: Part 1

A Light Under a Bushel

You’re walking along a straight beach at night and you happen to know there’s a lighthouse somewhere out to sea, shrouded in darkness on a rocky island. But it is no ordinary lighthouse: it emits flashes in random directions, favouring no angle over any other. Fortuitously, you have a set of sensors rigged up along the coastline that register the incident locations of any flashes that hit the beach. Where is the lighthouse?

It turns out that this little problem holds some salutary lessons for data analysts. The geometry of the scenario is shown below, with a sample of ten flash records generated randomly from the case (a,b)=(7,5). Our length units are kilometres. The task is to infer a and b from the flashes.

The Lighthouse Sets Sail

Let us first try the obvious thing and estimate a by averaging the x-values of the flashes:

(1)   \begin{equation*}  \hat{a}(\underset{\sim }{x},N)=\frac{1}{N}\sum _{n=1}^N x_n \end{equation*}

where the x-values of the flashes are \left\{x_n\right\}_{n=1}^N. Let’s experiment with this a little. If we set (a,b) = (7,5) and generate 20 sets of ten flashes, and apply the estimator \hat{a}(\underset{\sim }{x},10) to each set, then we get the following estimates of a:

{\footnotesize\begin{array}{llllllllll} 15.4225 & 14.6187 & 3.33113 & 6.74264 & 19.9288 & 5.04957 & 14.5059 & 13.0493 & 3.64369 & -3.69191 \\ 1.60463 & 11.256 & 0.46392 & 4.23715 & 8.61815 & 8.143 & 19.241 & 16.5687 & -7.83116 & 7.22419 \\ \end{array}}

Some of the estimates are close to 7, but others (e.g. 19.9, -7.8) are very different. Very well; this is what happens in random sampling: you get some random variation. Surely we can improve the estimate by increasing the sample size—the number of flashes per estimate. Let’s now try N = 1000 flashes for each estimate:

{\footnotesize\begin{array}{llllllllll} -23.3759 & 10.7517 & 4.44569 & -0.574101 & 5.06378 & 0.0604402 & 10.3572 & -1983.75 & 22.8081 & 8.34198 \\ 19.2871 & 11.7708 & 10.1388 & 38.9687 & 6.36427 & -8.78411 & 6.3699 & 5.31658 & 11.1447 & 2.53911 \\ \end{array}}

This time we have a -1983! Okay, it’s time to deploy the nukes: perhaps we can finally swamp these annoying variations with a sample size of a million!

{\footnotesize\begin{array}{llllllllll} 8.70474 & -6.68166 & 75.0219 & -0.772723 & 39.7443 & 5.22276 & -13.9028 & 28.0768 & -7.32828 & 37.5783 \\ 0.67937 & 8.75211 & -102.242 & 0.967948 & 9.76076 & 49.6395 & 23.4652 & -11.7712 & -11.5722 & 9.37723 \\ \end{array}}

Yet now we have seven results that differ from the true value by more than 20km! Frankly we still have no clue about the lighthouse’s position along the beach.

And Now For Something Completely Different

Shall we try thinking it through? Lurking somewhere deep beneath our futile attempt was an implicit assumption that the distribution for \hat{a}(\underset{\sim}{x},N) ought to tend towards a as the number of flashes N increases. Let’s make that intuition more precise. Let I stand for the information in the problem description, before any flashes were registered. Now if in addition the location of the lighthouse were known to be (a,b), then each flashpoint x_n would be subject to uncertainty which we shall describe by the probability density p_F(x_n | a,b,I). The form of this distribution follows from the lighthouse’s uniform habits and the geometry of the scene; we will say more about it later. Our estimator \hat{a}(\underset{\sim }{x},N)=\frac{1}{N}\sum _{n=1}^N x_n is a function of N and the flashpoints \left\{x_n\right\}_{n=1}^N, so its density is based on N, a, b, and I—call it p_E(\hat{a} | N,a,b,I). Now, our implicit hope was that as we increased N, p_E(\hat{a} | N,a,b,I) would pile up more and more narrowly about a, so that when we came to take a variate from p_E(\hat{a} | 10^6,a,b,I), it would be pretty close to a. We can make the idea explicit by naming what we consider to be a small deviation from a as \epsilon, and considering the probability

(2)   \begin{equation*}  P\left(\left| \hat{a}-a\right| >\epsilon | N,a,b,I\right)=1-\int_{a-\epsilon }^{a+\epsilon } p_E\left(\hat{a} | N,a,b,I\right) \, d\hat{a}=1-\int_{-\epsilon }^{\epsilon } p_E(x+a | N,a,b,I) \, dx \end{equation*}

which we’d like to be a (generally) decreasing function of N. Moreover, we’d like it to decrease steeply enough that it is close to 0 for feasible values of N. In order for this estimator to have some modest utility, we might specify a deviation of \epsilon = 1km, and hope that P(\left| \hat{a} - a\right| > \epsilon | N,a,b,I) falls to 0.1 for a cost-effective number of flash records N = 100. This means that our estimate \hat{a}(\underset{\sim }{x},100) would have a 90\% chance of falling within a kilometre of a.

The Ship Runs Aground

As you no doubt noticed from our earlier simulations, our estimator achieves nothing of the sort. The reason lies in the form of the flashpoint distributionp_F(x_n | a,b,I). Consider first what would have happened if the flashpoint distribution had been the Gaussian distribution with mean a and standard deviation b. Here is a random set of flashpoints from that distribution for our example case (a,b) = (7,5):

gaussian

We need to calculate p_E\left(\hat{a} | N,a,b,I\right), the density for our estimator \hat{a}. The estimator is a sum of N independent Gaussians with mean a and standard deviation b, divided by N. It turns out that the distribution of a sum of independent distributions is the inverse Fourier transform of the product of their Fourier transforms. Performing this for the sum of Gaussians and following with a change of variables to account for the denominator N, we get a Gaussian with mean a and standard deviation \frac{b}{\sqrt{N}}. Following equation (2) with \epsilon =1, our probability-of-bad-estimate as a function of N is

(3)   \begin{equation*} P\left(\left| \hat{a}-a\right| >1 | N,a,b,I\right)=1-\frac{\sqrt{N}}{b\sqrt{2\pi }}\int_{-1}^1 e^{-\frac{N}{2}\left(\frac{x}{b}\right)^2} \, dx \end{equation*}

which is plotted below for b=5.

number of flashes chart

Thus, if the flashpoint distribution were the given Gaussian, and the lighthouse were in fact 5km from shore, we could achieve the desired 90\% chance of getting a good a-1<\hat{a}<a+1 estimate by collecting just 68 flashes.

Now let’s return to the case at hand and figure out the form of the flashpoint distribution p_F\left(x_n | a,b,I\right). A flash which is eventually recorded must depart the lighthouse at an angle 0<\theta <\pi, as shown below.

theta line

Before we observe the flash, \theta is distributed uniformly on (0,\pi ):

(4)   \begin{equation*} p_{\theta }(\theta | a,b,I)=\frac{1}{\pi }\;\;\; ,\;\;\; 0<\theta <\pi \;\;\;\text{(else 0)} \end{equation*}

From the geometry we have

(5)   \begin{equation*} \tan \theta =\frac{b}{a-x_n} \end{equation*}

and the change of variables from \theta to x_n yields

(6)   \begin{equation*} p_F\left(x_n | a,b,I\right)=\frac{b}{\pi \left[\left(x_n-a\right){}^2+b^2\right]} \end{equation*}

This is the density, known as a Cauchy or Lorentz distribution, governing each flashpoint. If we superimpose it on our original set of flashpoints, it looks like this (scaled up for visibility):

cauchy

Here’s the rub. When we calculate the density p_E(\cdot | N,a,b,I) using a Cauchy flashpoint distribution, we get p_F(\cdot | a,b,I)! That is, the density for the estimator is just the same as the density for the flashpoints, which appears above, and does not depend on N. So, collecting many flashpoints and taking the mean is mathematically equivalent to estimating a by a single flashpoint.

For the same reason, the probability-of-bad-estimate is not a function of N and doesn’t decrease with N:

(7)   \begin{equation*} P\left(\left\left| \hat{a}-a\right\right| >1 | N,a,b,I\right)=1-\frac{1}{\pi }\int_{-1}^1 \frac{b}{x^2+b^2} \, dx=1-\frac{2}{\pi }\arctan \left(\frac{1}{b}\right) \end{equation*}

which for b=5 evaluates to 0.874. Our modest goal of a 90\% chance of a good estimate has been all but turned on its head. The Cauchy distribution also has heavy tails, which means many of the 87.4\% of estimates in error by more than a kilometre will be out by much more than a kilometre, such as the -1983km we saw earlier. Our only hope is that b is small:

However, in order to bring the probability-of-bad-estimate down to p, we need to bring b down to \epsilon \tan \left(\frac{\pi }{2}p\right), which, with our p=0.1 and \epsilon = 1, evaluates to 158 metres! In other words, the method begins to provide a somewhat-reliable, somewhat-accurate estimate of the lighthouse’s location along the beach only when the lighthouse is almost literally a stone’s throw from the shore.

And we haven’t even tried to estimate b!

In the next post, we will demonstrate how to remedy our signal failure by seeking safer shores in the sound principles of probability theory.

References

Sivia, DS & Skilling, J 2006. Data Analysis: A Bayesian Tutorial. Oxford University Press, Oxford, United Kingdom.