Early Birds and Sleepy Heads

Background

I have been making records of the birds seen in my back garden, together with the time of each sighting. Here I am trying to display any relationship between the species of bird seen and the time of the sighting. Are some birds earlier risers than others?

I have a small back garden in North Wiltshire, in a housing estate at the very edge of a market town. It is separated from an estate road by a Leylandii hedge, and contains a Manchurian Cherry Tree (Prunus Mackii). There is another house and garden to each side. I maintain two sunflower seed feeders, and a peanut feeder. I typically keep an eye out for birds from the breakfast table, and for about two hours combined reading and birdwatching on Saturday mornings.

Because my times are not random samples of all possible viewing times, all I can hope to describe are the times I see birds, not the times that birds are present. However, I hope that many of the confounding factors will affect different species in much the same way, so that comparisons between species will be reasonably reliable. One unfortunate confounding factor is between the time after sunrise a bird is seen and the time of year when it is seen - because my schedule is driven by clock time, but the birds (I presume) by solar time. The BTO's garden birdwatch shows that the reporting rate of birds varies throughout the year, even for non-migratory birds, so this is definately a worry. (See the GBW results at Garden BirdWatch Species Trends). I could well be missing some bird that rises with the sun, if it is most common in summer, when I rise some time after the sun.

What to plot

One way to display any relationship between time and sighting would be to draw a timeline showing time after sunrise and bird sighted. I worry that any pattern in the data would in this case be hidden by the inherent patchiness of such a graph. Another way would be to note down the numbers of sightings in some period, but I don't know how to trade off the noise seen in short periods against the lack of detail in long periods. Instead, I have decided to consider, for each time, the proportion of sightings from each species seen at or before the given time, thus giving a sort of cumulative probability distribution for each species.

That's not quite what I plot, though. To show up any deviation from an overall pattern, I will plot, for each species, the difference between the proportion of sightings of that species at or before each given time and the proportion of all sightings at or before each given time. In this way, a period of time in which a species is relatively infrequent should correspond to a downward slope, and a period of time in which a species is relatively frequent should correspond to an upward slope.

Am I seeing behaviour or coincidence?

I want to be able to look at the graph and say 'species A here plots lower than species B so A is relatively less likely to be seen before this time than B'. So I want to put some indication on the graph of the reliability of the figures. Note that, although I plot species proportion minus overall proportion, when I compare two lines the contribution due to the overall proportion cancels out, so I am only interested in the uncertainty in the species proportions. I will produce confidence intervals such that the probability of producing a confidence interval not containing the true value is at most 0.05/N, where there are N species in the graph. If I then find two non-overlapping confidence intervals, I will mistake the true relative order of the proportions for those species only if at least one of my N confidence intervals for that time is wrong. With N species this will happen with probability at most 0.05, or 1 in 20 - which is the weakest of the common statistical measures of significance. Note that this (Bonferonni) correction does not require the figures for the various different species to be independent of each other.

I do not in fact plot raw proportions. Some days bring more cats than birds. Other days bring flocks of half a dozen Greenfinches. I don't want more populated days to influence the figures too much. So I limit each day's contribution for any time to the fraction of birds of the species in question seen before that time on that day. The figure plotted for each time is therefore the sum of all these fractions, divided by the number of days yielding any sighting of the relevant species. Some days will contribute mostly 0s and 1s to the sum. Other, more populated days, may yield 0.1, 0.25, 0.333, and similar terms. I will assume that individual day's figures are independent of each other.

Normal Approximation

So I want to set confidence intervals for the mean of a distribution which is pretty non-normal, but which we know is confined between 0 and 1. If we denote our unknown mean, the expectation of the distribution, by x=EX, we have 0 ≤ x ≤ 1. Because X is confined between 0 and 1, we have 0 ≤ X2 ≤ X ≤ 1 so EX2 ≤ EX, so Var(X) ≤ x(1-x). This limit is reached by distributions with P(X=1) = p and P(X=0) = (1-p). Because we have a sum of independent identically distributed variables with finite variance, the central limit theorem tells us that the distribution of the sum is asymptotically normal. We can set an approximate confidence interval by finding the two values of x which yield the desired sigmage for the deviation between x and the observed mean, given the worst case variance under x. Given an observed mean m and a sigmage s we take t = s/n and have

(m-x)2 = tx(1-x)
m2 - 2mx + x2 = tx - tx2
(t+1)x2 - (t + 2m)x + m2 = 0
x = [(t + 2m) +/- sqrt((t + 2m)2 - 4m2(t+1))] /
    [2(t+1)]
x = [(t + 2m) +/- sqrt(t2 + 4tm(1-m))] /
    [2(t + 1)]
We can apply a 'continuity correction' by adding +/- n/2 to the value of m before our calculation.

As an example, I will take s=2.5758, which gives a confidence (for a normal distribution) of 99%: that is, the probability that the confidence interval you get does not contain the true value is 1%. As an example, here are the confidence intervals for various values of n and m (labelled 'Quadratic Interval'):
MeanObservations Quadratic Interval Continuity Correction Binomial
0.51000.42080.5792 0.41590.5841 0.36890.6311
0.5100.27370.7263 0.23490.7651 0.12830.8717
0.21000.14370.2714 0.13940.2768 0.10840.3212
0.2100.07030.4526 0.04490.5038 0.01090.6482
0.11000.06150.1586 0.05760.1645 0.03200.2020
0.1100.02340.3405 0.00720.3985 0.00050.5443
0.01000.00.0251 0.00.0341 0.00.0450
0.0100.00.2048 0.00.2772 0.00.3690

Note that the behaviour close to zero is entirely sensible: if m = 0, the equation (m-x)2 = tx(1-x) has one root at 0 and one at t/(1-t). This contrasts sharply with another possibility: the confidence interval based on the t test. If this was given 10 (or 100) observations of 0, its estimated standard deviation would be zero, leaving it unable to give any sensible answer. The table above can also be considered as a normal approximation for the confidence interval of a binomial distribution, since the binomial distribution has exactly the mean and variance we have taken here. Furthermore, a binomial distribution is an allowable case of the distributions considered here, so if the intervals given above do not include the intervals we would get for the binomial distribution, our normal approximation has led us into error. The corresponding confidence intervals using the binomial distribution are as in the right hand column above, so you can see that the intervals are indeed too narrow. The binomial confidence intervals in fact strictly include both the intervals based on the normal approximation. That would not be a practical problem in itself, since binomial confidence intervals are readily available (at least when n times the mean is an integer). We could overcome the problem of non-integer sums by rounding up to produce the upper confidence limit, and down to produce the lower confidence limit. The worry is that even those intervals may not be wide enough.

A notional example where binomial intervals are too narrow

Suppose that we observe the sum of 10,000 samples from a [0,1] distribution, and that this sum is 9998. To make this come out, I have to choose very unrealistic parameters, but they will just about produce a 2-sided confidence interval. If we set p=0.999731, we find that the probability of producing 9998 or more successes out of 10,000 is 0.4961. Since this probability is < 0.5, it could be produced by taking a two-sided confidence interval, admittedly at a very low level of confidence, in fact a 0.78% confidence interval. Since 0.999731 < 0.9998, this would be the lower bound of a two-sided confidence interval. Now suppose that instead of a success scoring 1, it scored 9998/9999. Then to reach a total of 9998 we would need to succeed 9999 times. However, the probability corresponding to the same mean as our previous setting is 0.999731 * 9999 / 9998 > 0.99983099. The probability of having 9999 or more successes with this probability is 0.4963 which is greater than the previous probability. So a two-sided confidence interval of just less than 0.78% based on the binomial distribution would exclude a mean as low as 0.999731, but by using a distribution with weight only on 0 and 9998/9999 we can achieve the observed sum with higher probability than is required to get the mean within the confidence interval.

Chernoff Bounds

One version of this appears as exercise 22 of section 1.2.10 in Knuth Volume I (The Art of Computer Programming, that is). Where we have a sum of binomial distributions with mean m, P(X ≥ mr) ≤ (r-rer-1)m when r ≥ 1. The proof there uses generating functions, but you can think of this as considering E(tX), so it works for continuous distributions as well. In fact, as you might guess from the result, the proof quickly ends up considering E(etX), in very lightly disguised form, and so is really a proof using the characteristic function. Now, since ex is a convex (u-shaped) function, the distribution with a given mean in our class of distributions with the largest value of E(ex) is the binomial distribution, so this result can be used to bound the probability p(X ≥ mr) regardless of whether the probability distribution is binomial, as long as it is confined to [0, 1].

The problem with this is that the bound is not very good. Suppose that n=100 so if the individual distributions have p(X=1) = 0.5 we have m=50, with a variance of 5. Then a 2-sigma deviation is to a value of 60, with r=1.2. By the Chernoff bound, the probability of this is at most (1.2-1.2e0.2)50, which is 0.391. However, the probability of a 2-sigma deviation in the normal distribution is < 0.023, and the tail probability for the binomial distribution in question is in fact 0.0284.

A Randomised Test

One way to apply the binomial distribution to the situation is to take the observed fractions and use them to generate 0s and 1s, where the value generated from a given fraction is 1 with probability that fraction. The result of this is a sequence of 0s and 1s with underlying probability according to the mean of the original distribution. So we can generate a confidence interval on this according to the binomial distribution. One problem with this (apart from the fact that we are obviously losing information) is that applying the same test to the same data at a different time will produce a different interval. I am inclined to take this as evidence that applying a binomial confidence interval to the original sum is pretty safe, and I can put some sort of figure to this.

Let us suppose that, given the raw data, we work out all the possible confidence intervals the randomised test might come up with. These are the binomial confidence intervals from 0, 1, 2... n successes in n attempts. Because e.g. the upper bound for a confidence interval is the probability satisfying p(SUMiXi ≤ S) = c, for some c, we know that the upper and lower bounds of the confidence interval increase with the number of successes. We now chose the median interval. What is the probability that the data are such that this interval does not include the true value? Well if it is too low (high), then so are all the intervals to its left (right). Since we chose the median confidence interval, this includes at least half of the intervals, weighted by probability. So choosing the median interval provides a probability of getting a bad interval at most twice the probability from the randomised test.

How many tests?

Running 100 statistical tests and then publishing only the most significant result without comment is cheating - at random, we expect one result significant at the 1% level from 100 tests. What about looking at a graph to find out where two confidence intervals separate? For our bird data, we could end up plotting one point on the graph for each observation time. If each point corresponds to a statistical test, this is a huge number of different tests.

The standard correction for multiple tests follows from the Bonferonni inequality: p(X | Y) = p(X) + p(Y & ¬X) ≤ P(X) + P(Y). If we make N tests and consider the most significant tail probability found, we should multiply this by N first. For an extension, see Holm's Sequentially Selective Bonferroni Method. These methods are safe even if the individual tests are not independent, but they can become very conservative. In our case, the individual tests are very closely related. The fraction of birds of a species seen before 10:00 will be very similar to the fraction seen before 09:59, and before 10:01. We cannot afford to treat the data from N observations as N different tests - but we need to provide a credible confidence interval. What should we do?

Quantile-Based Bonferroni corrections

The statistic we are plotting should change slowly over time, so we don't need to believe short blips of significance: we expect any effect to show itself for reasonable portions of time. Suppose that we write down all the significance levels produced, and then take the median significance level, or, more generally, the 90th or 99th percentile significance level. If the data are random, what is the probability that we reach some given significance level, or greater? We know that if we took, at random, a single significance level from truly random data, that the probability of achieving a tail probability of p or smaller is at most p (could be less if we have a conservative test). If the probability of achieving a given significance level at the median is q, then the probability of achieving this with our single randomly chosen point is at least q/2, because if the median achieves this level so does the half of the data appearing more significant than the median (it can be more than this if the test is such that there is a non-negligble probability that two samples will be given the same significance level, in which case there may be samples to the left of the median with the same significance level). Similarly, if the probability of achieving a given level at the 95th percentile is q, the probability of achieving it in our randomly chosen single sample is at least q/20. So if we decide beforehand to e.g. look at a single quantile of the significance level achieved, we can correct for the multiple-test effect without killing ourselves due to very highly correlated samples by dividing the raw significance by the fraction of the data above the quantile - so multiplying by two for a median, by 20 for a 95th percentile, and so on. Of course, we typically won't specify one particular quantile in writing ahead of time, just as we don't necessarily write out a complete testing strategy in detail before seeing the data, but we are now into the realms of common-sense judgement. If you want to make this rigorous, you might consider strategies such as considering the 99.9%, 99%, 95%, 75%, and 50% percentiles, and using a second Bonferroni corection to account for this.

Example

Suppose that we have 1000 observations (actually I have 1170 so far). If these all turn out to be to be at different times I will have 1000 points to plot on the graph (actually I only have 447 points). If I use the standard Bonferroni correction, I would have to multiply any tail probability by 1000, so to produce a 1% tail probability after the correction I would need a tail probability in the raw data of 1E-5 or less. My bird observations are generally between 06:00 - 20:00, which covers 14 hours. If I expect any period of significance to extend for pretty much an hour, I might hope to see at least 5% of the values significant at some level. So it would be reasonable for me to look at the 95% quantile significance level from my 1000 observations. Since I only need to multiply this by 20 to get a credible significance value, I need to see a tail probability of 5E-4 or less - still quite demanding, but 50 times less so that the standard correction.

On the other hand, this quantile-based test shows me that something in my data is significant, but doesn't point out what it was. Suppose I make 1001 tests, and the real situation is that 496 of them are subject to some influence that produces extremely significant results. If I look at the median significance level, I will be looking at the 5th highest value from 505 non-causal tests, so I will probably see a tail probability of about 1 in 100. If I multiply this by two (because I am looking at the median significance level) I will see a value of about 1 in 50, which is significant - but the test occupying the median position is not significant at all: the true non-random behaviour is confined to the 496 high-scoring values above it.

Ruger and Simes

Looking round the web for Bonferroni, it seems that these 'quantile-based Bonferroni corrections' are an example of Ruger's inequality: see Combining statistical tests by multiplying p-values. This paper also quotes Simes as pointing out that if the many tests are in fact independent of each other, you can test the kth probability, in order, against kp/n, and reject the null hypothesis if it is smaller than this, while achieving a false alarm rate, if the probabilities are independently uniformly distributed in [0, 1] of exactly p.

This isn't too hard to show with a multiple integral, if you avoid getting tangled up setting the limits. Call the probabilities, in order, x1 ≤ x2 ≤, ... xn. The random variables xi are uniformly distributed on the section of [0, 1]n cut out by the inequalities that say that they have been sorted: sorting them maps an out-of-order section holding the unsorted values evenly on to the single ordered section. If you work out the multiple integral
I(xn=0..1) I(xn-1=0..xn) .. I(x1=0..x2) dx1 .. dxn.
you get 1/n!, which tells you that the chance that the original values turn up in sorted order is 1/n!. The probability of not producing a false alarm is
I(xn=p..1) I(xn-1=(n-1)p/n..xn) .. I(x1=p/n..x2) dx1 .. dxn.
The bounds of the various integrations don't collide, because putting the integral over xn on the outside makes sure that they decrease from the outside in. The value of the innermost integral is x2-p/n. The value of the two innermost integrals then is (after a cancelation) x32/2 -px3/n. By induction, the value of the i innermost integrals turns out to be xi+1i/i! - pxi+1(i-1)/(n(i-1)!). To get the value of the whole thing, define xn+1=1 and you get (1-p)/n!. The factor of n! again represents the chance that the n values come pre-sorted, so the chance of a false alarm is exactly 1-p.

Holm-Bonferroni

See Holm's Sequentially Selective Bonferroni Method for more stuff on Bonferroni's inequality.

Results

So far I don't really have enough data to be very sure of anything much - even plotting using the normal distribution-based confidence limits and 2.5758 sigma (for 99% confidence before Bonferroni, ignoring the problems with the normal distribution) most of the confidence limits include zero.

I am considering only 5 species: Blackbird, Blue Tit, Greenfinch, Sparrow, and Starling (so my 99% confidence interval before Bonferroni becomes a 95% confidence interval afterwards). I do record a wider range of species, but I have picked just five to look at here: they are among the most common in my records, but I have also allowed personal interest to sway my choice. The most striking sleepy head so far is the Starling (except for a burst of energy early on), and the best early riser the Blackbird. Actually, the early risers aren't anything like as striking as the Starling's sleepiness: they may only be early by comparison. Here are the plots. I plotted the difference between the mean fraction of birds seen of that species before any given time and the mean fraction of all species before any given time, using symbol 'o', then plotted points above it ('+') and below it ('-') based on the width of the confidence interval for the per-species value alone. Times are in seconds since sunrise. The plots were generated using the R statistical and graphical package.
graph for starlings
graph for blackbirds