This describes a collection of events in the Harry Potter stories (up to and including "Harry Potter and the Half-Blood Prince"). I count them and draw graphs from the result, in an attempt to see if the personalities described in the books can be squared with a cold-blooded accounting of their actions. Here is the link: Who attacks Who.
Here's some fragments (mostly to do with Statistics) that may or may not be well known (to those that know them). If you have any comments on the ideas - especially references to books where I can find out more about them (or why they're a bad idea) - please send them along to firstname.lastname@example.org. Feel free to copy and reuse everything on this web site (at your own risk), but I would appreciate an acknowledgement if you do.
I am extremely short-sighted. A few years ago there didn't seem to be a lot known about the causes of this; in particular, there were no really rigorous experiments done to establish the effects of exposure to near work, such as reading. Myopia and Near Work argued for just such an experiment; it gives a few references to the state of knowledge at present and shows how such an experiment might be done, in sufficient detail to get an idea of the number of volunteers that might be needed. There is sufficient precedent from the few studies that have been done to assess other interventions, in particular the use of special glasses, to establish this.
More recent work shows that the connection may be between lack of exposure to daylight and myopia. URLs like http://www.newscientist.com/article/mg20427331.100-generation-specs-stopping-the-shortsight-epidemic.html?full=true report both observational studies of children and experiments with chicks, a very powerful combination: the studies of children show relevance in Homo Sapiens, and the studies in chicks show that lack of exposure to daylight can cause myopia, so that the effect in children need not be due to myopic children avoiding outdoor activities.
Most attempted interventions have been unsuccessful. Undercorrection (prescribing weaker glasses than required, or not correcting at all) has been tried widely, but may make things worse than accurate correction, which at first seems counter-intuitive, if we believe that (as correlation between myopia and literacy appeared to show) near work can cause myopia. There are models of the growth of myopia that account for this. a toy model of growth in the normal and myopic eye explains a very simple mathematical model, inspired by more complex models validated by computer simulation. I am leaving it here in case anybody is interested in what little control theory it contains: it shows, for instance, that reversing the sign of information derived from sensors is likely to confuse any control system (hints of Dr Who "reversing the polarity of the neutron flow").
Many books on software engineering are top down; they present the one true way to develop a computer system, justified by the author's experience and judgement. Comparing one book with another, we conclude that they cannot all be correct. In Design Notes I have written a collection of notes from the bottom up; each note can be read in isolation, and makes a few observations that I hope will justify themselves to you on their merits, or by references made within the note. Even the entire collection taken together doesn't tell you how to develop a computer system, but I hope that what it does say is correct, and at worst irrelevant.
The design of APIs is an especially important part of software design: the design of an API, and of the example code demonstrating that API, is tremendously influential, and, once published to its users, cannot be easily changed. While reading around this topic I have noticed that, although it is not plausible that any one method will mechanically produce a perfect API design, it is plausible that a checklist might be useful in evaluating an API design, just as a cognitive walkthrough, and other checklist-driven methods, can be useful in evaluating a GUI design. API Design Checklist contains some notes from my reading, and culminates in a couple of checklists, one for checking that an API is minimal, and one that it is easy to use.
Long ago now, I was set the problem of finding the fastest route for a ship from one point to another, taking account of weather. I was surprised to find that the obvious building block of finding the shortest route through a graph has drawbacks here: much more recently I returned to this and wrote and tested a possible solution - see Ship Routing / network abstractions of continuous spaces.
A publish-subscribe server receives publications and subscriptions from other programs. It matches incoming publications with existing subscriptions and notifies those subscribers. It turns out that this is a very flexible and robust way to link multiple concurrent distributed programs; because publishers do not talk directly to subscribers, they do not need to be changed if new subscribers are added that wish to be notified of their data. Publish-Subscribe gives more details of this and points at a simple Java implementation, with demo code.
A use case is a page or less of text usually showing, by example, how a person can interact with a computer to perform a task. I believe that these are exceptionally useful in dealing with computer systems, and they may be one of the most widely useful discoveries of software engineering. In Uses of Use Cases I describe some of the reasons why they are useful when developing computer systems.
Beam Search (called "Stratified Greed" by Knuth), is a general way for computers to solve problems in which you can decide on just part of the final answer, look to see if that partial answer looks promising, and then make another decision to extend that partial answer towards a full answer, or give up and try another partial answer. For example, if an answer is 100 characters long, you might be able to decide how promising a guess for just the first character is.
Beam Search is a simple and natural process that works well enough to be used in practice. Not that much has been said about it because, for completely general applications, there are counter-examples that show that its behaviour is not completely predictable. In particular, you can do more work and get worse answers. In Beam Search I compare it with less general alternatives, for which there are stronger guarantees of performance, and show that, if you run Beam Search in these sorts of situations, you can do about as well as those more sophisticated alternatives, without having to know quite as much about the problem.
I show that this justifies the use of Beam Search in some situations, and tells you how you should tune it to approach the performance of its more sophisticated competitors more closely.
In this world of so-called big data, I like the idea of writing a program that could read in a large dataset and pick out a small region that doesn't behave like the rest of the data: it might be interesting to find out why not. Here be Dragons - outlier detection for Weka describes progress so far at a very early stage of writing Weka plug-ins for this (Weka is a Java framework for data mining). This is very much work in progress - the main reason for writing it up is to provoke more ideas on it, and to motivate me to continue working on it.
The UK government has released a large variety of datasets. I started by looking at some NHS prescription data, then moved on to some data about schools, and then some crime statistics. I found I could use a similar approach on each dataset, so reusing code and giving that code a more thorough test. In all of these cases, I fit mixture models with EM and then look for cases where no mixture of the fitted distributions fits well.
The UK government has released information on prescriptions written by all General Practices in England. I wrote a program to look for outliers in this dataset, and I describe the results in Prescription (big?) Data and Outliers.
This data gives summary results for local authorities: the percentage of pupils in different groups passing a threshold of achievement. The groups are split by ethnicity, by whether they speak English as their first language, and whether they are eligible for free school meals (a measure of poverty). I describe this in Schools (actually quite small) Data and Outliers.
This data shows the offences recorded in (pretty much) local authority regions in England and Wales. Again I look for cases where the distribution recorded is unusual by fitting a mixture model and looking for exceptions. The writeup is in Crime (actually very small) Data and outliers.
If you can read in a file in CSV format and arrange its records into trees, then you can draw affinity diagrams and compute Franklin's "Moral Algebra", (You can sort of do this by cutting and pasting rows in spreadsheets, but the result isn't quite as pretty). Trees, SpreadSheets, Affinity Diagrams, and Ben Franklin describes a Java program that supports this.
Another web page along similar lines is Latin Hypercube sampling. This allows you to write fragments of ideas in the cells of a table, and then permutes the columns for you, so that each row is a different combination.
I am told that, in order to poison rats effectively, you must supply the poison in at least three different forms, or they will work out what is ailing them and avoid it. I'm not sure that humans could do as well. Smarter than the average Rat? tries to design efficient experiments to isolate up to k different foods (or exercises, or whatever) than have disadvantages that aren't obviously linked to them.
Intel's "tick-tock" design strategy shortens the interval between the releases of new chips. More general strategies might exist to decrease the response time of teams, and so give them a competitive advantage. Psychological refractory period, OODA, and Tick-Tock describes and measures one feature that increases the time taken to respond to events, and suggests possible ways to reduce this effect.
Penance could also be called "things I wished I'd thought of at the time". A little on freedom of speech, a lot on my spectacular unsuccess at fathoming people's reasons for doing things, and pointers on (not) being sore after exercise.
The Ethics of Creating Virtual Universes attempts to describe some of the questions an academic research project might have to answer if it did in fact have the ability to simulate a virtual universe detailed enough to be representative of our own. It also suggests how an even more powerful entity might end up doing something very similar almost by accident. (While I have tried to write something that is self-consistent, and even accurately reflects the relevant parts of theoretical computer science, I am well aware that we do not, in fact, have such resources, and so I do not expect this to be taken entirely seriously).
Once you have broken a computer-related project into a collection of smaller tasks, you have to work out which to do first (unless, of course, you have no more tasks than you have people). One reasonable recommendation is to do the most risky tasks first, so that you find out the worst as early as possible, while you still have the flexibility to reschedule round it. If you have to fit your answer into a few sentences, this is a pretty good one, but it's unlikely to be the whole story, because scheduling work is a whole subject in its own right. Scheduling Risk works through a few examples to make "do the most risky tasks first" a bit more precise, and to show that there are circumstances where this rule of thumb does not produce the best possible solution.
Software projects, especially large, government-run, software projects, have a tendency to consume much more time (and resources) than originally planned. One scheme for estimating progress and time to completion, in the tradition of Agile Project Management, is to divide the work up into tasks, and use the time taken for the work so far to estimate the time required to complete the entire project. In Estimating the progress of a project I consider one challenging, but not completely implausible, probabality distribution for the time taken to complete tasks and show that, with this assumptions, quite large safety factors are required if projects are not likely to exceed the predicted time with noticeable probability.
If you want to generate a small permutation at random - say, on a few thousand, or even a few hundred thousand objects, there are well-known ways to shuffle an array of numbers - e.g. java.util.Collections.shuffle(). If you want to create a permutation on a few billion objects, you probably want to generate the permutation without creating an array of this size, by working out how how to compute a function which tells you the position of a particular input object after applying the permutation to it. This requirement is also known as format-preserving encryption, because it is related to ordinary encryption: the main difference is that normal ciphers are permutations, but permutations on a fixed number of objects: DES and AES are keyed permutations on 264 objects. Permutations from hash functions and block ciphers describes this, and points out, by the way, that most block ciphers, treated as keyed permutation generators, can generate only half of all possible permutations.
mcdowella.zip is a jar file containing Java source referred to here, and in the pages linked to this. It's a .zip in the hope that this will prevent this web server from translating line feeds. It's source because I've sometimes found it's easiest to spot mismatches or messed up CLASSPATHs in Java if you recompile everything, and to encourage you to keep thinking of these pages as a source of ideas that you may or may not decide to agree with, rather than anything I've claimed must be true. Which brings me to...
I'm just having fun here! I don't work at statistics for a living. So by all means pick over this piece by piece and make your own mind up. Don't say you weren't warned if all the code is buggy and all the proofs invalid.
I spent a few months of evenings, weekends, and holidays considering how you might look for useful plants if you were, for instance, a drug company looking for plants with medicinal properties. People actually do this - the basic tool is mass screening, but I was interested in how you might decide which plants are worth screening. I was also interested in the general strategy of picking out a good shortlist of things in general for further testing, and in analysing datasets downloaded from the internet, as a second best for going out there and getting your own data. I'm not able to test my ideas for selecting plants by going out there and running mass screening sessions, so I had no choice about that one, anyway! Summary of plant search project is a link to a summary of my report. The report itself is in StarOffice word format, with loads of graphs. Here is a link to a zip file containing an html translation of the report text: writeup.zip. I separated the summary because the writeup is quite long, and excluded the pictures because they make it just too big to be a sensible download.
Devizes canal has a series of apparently identical pools, populated by different birds. It turns out that the birds don't see these pools as identical: they are picky enough that data from only six visits can prove that a difference exists. This is described in Birds in Devizes Pools.
In my quest to see more birds, I have accumulated 4 mid-price binoculars with specifications of 7x42, 8x30, 8.5x44, and 10x40. Which of these actually let me see more birds? The answer appears to be that none of them are all that much more productive than the others - details in Birds by Binocular.
Birdwatching Morning and Evening shows that you see more birds in the morning than in the early afternoon, even if you are looking around a park with a lake, where you might expect the birds in the lake to stay put (they do, but the land birds nearby make themselves scarce in the afternoon).
I have started to take an interest in birdwatching, as an opportunity for amateur science. Having collected a list of times various birds were sighted in my garden, I have tried to find out what species were early risers and what sleepyheads. Analysing the data exposes some interesting little statistical puzzles: the writeup Early Birds and Sleepy Heads concentrates on those, but there is a summary of the actual results at the bottom, together with two graphs. (Spoiler: not quite enough data yet).
Duck Ponds aren't all identical! Your local duck pond may well host a slightly different collection of birds to my local pond, and to the pond in the next town along. So don't say "oh, it's just a duck pond" and ignore it - have a look and see what is there and what isn't. Duck Ponds demonstrates this by looking at what birds I saw, over a number of visits, in "duck ponds" (two of the three are actually rivers) in Calne, Chippenham, and Devizes.
Suppose you want to examine a hedgerow. Is it best to scan slowly along it, or to hold your point of view steady for a succession of short periods, flicking quickly from period to period? Some evidence on this question is at Scanning for Motion.
Prediction, Laplace, and the Second Coming is a rather light-hearted attempt to work out what we can say after waiting a measured length of time for an event and observing no occurences of it whatsoever. It was originally written with an event in mind that has been promised for nearly the past 2000 years, but that's really just an excuse to talk - don't take my views on that too seriously.
Perhaps you just need to look at the problem in a different way, modify your set ideas, redefine the client's perception of time :-) The Eternity Clock is one way of wishing away these problems.
The motivation for Exact and semi-Exact statistics is one of distrust of officialdom. I have seen people work hard and gather statistics showing this, that, or the other, and present them for public view. The official response was not "It's a fair cop, guv!", but "This is a small collection of isolated examples. Everybody knows you can't tell anything from small samples". Well, you can. A lot of the time what you can tell is just that there isn't enough data, but you can at least tell that for sure, and not just hear it as a convenient excuse from somebody.
Prediction and modelling is an particular application of a standard idea I haven't seen elsewhere (that's a warning, not a boast!). It offers reliable confidence intervals for predicted values under very general assumptions (which unfortunately are not quite general enough to be that useful in real life).
Use your skill and judgement: human judgement and statistical tests describes two programs that allow you to compare the judgements you make by eye about curve-fitting and time series prediction with those made by a computer program. For the simple (but common) curve-fitting task, I think that any curve-fitting judgements should be backed up by a computer program. For time series prediction, I believe that the relative effectiveness of the two methods, and very likely any two methods, depends so much on the characteristics of the particular time series under study that the first step should be more detailed study of the characteristics of the time series, rather than attempts at either automatic or manual prediction.
Three (and a bit) Smoothers describes a simple smoothing algorithm using locally weighted linear fits and compares it with R's lowess and with what seems to be an abuse of the Kalman filter.
Linear smoothing and binomial models describes programs intended to draw smooth lines through observations of binomially distributed, or nearly binomially distributed, values, such as the win/lose/draw results of football games, or integer counts restricted to a small range. (It is illustrated with data from the 2006/2007 football season, because that is what it was easy to get my hands on, but I originally planned to use it with records of the sightings of birds throughout the day).
This is just a slightly different version of a proof of a one-tailed version of Chebyshev's theorem that I came across because of Henry Bottomley's web page, at http://www.btinternet.com/~se16/hgb/cheb.htm.
A counterexample Involving medians shows that if you know the medians of the distributions behind X and Y, you don't necessarily know which of P(X < Y) and P(X > Y) is more likely.
Posterior predictive p-values are used for model checking in Bayesian statistics. The idea is to compare the data actually seen with predictions based on the posterior distribution of the model parameters after taking account of that same data. This sounds a bit fishy, but the web page shows that there are some relevant theorems, including one that the probability of the observation under the posterior is always at least its probability under the prior: a mathematical proof that hindsight is easier than foresight!
Suppose that you run a computer program to make N statistical tests and get a lowest tail probability of p. Are you surprised? Holm's Sequentially Selective Bonferroni Method justifies the Bonferroni principle of examining Np, and describes a neat improvement which never provides a less sensitive test, and is slightly better when it is possible that more than one of your program's tests can find significant results simultaneously.
The specification for a JVM capable of running on a palmtop or mobile phone does not include floating point. This is a problem if you are trying to write some sort of domain-specific calculator. I believe that SUN should define the interface to a floating point library as part of the MIDP specification, and provide a reference implementation. Floating Point for CLDC Java provides more details, and a demonstration implementation for an example interface.
V. Anashin found a neat way of constructing full-cycle random number generators from a wide variety of building blocks, and Mok-Kong Shen drew sci.crypt's attention to it. This construction doesn't guarantee good properties (apart from the full cycle) but I think it's lovely. Here's a link to a short writeup: Anashin's long cycle random number generators.
Here is a brief description of the programs and classes in the jar file. All of this is in Java, and most requires Java 1.2 = Java 2. (This is because I am a great fan of Collection Classes, which Java 1.2 provides in java.util).
I am delighted by any algorithm that is clever and effective, but still simple enough for me to understand and implement quickly. I plan (as a very low priority task) to write prototype pre-beta quality implementations of whatever happens to catch my fancy and put them here. For now, here is a very short list.
Patricia Variations is mostly about uses of the Patricia datastructure. This is best known as a way of implementing a lookup table where the keys are variable length strings, with a restriction that no key is a prefix of any other key. I show that you can work around that restriction efficiently by virtually recoding the keys to a prefix-free form. Via this encoded world you can implement a useful extra operation efficiently: find the longest key which is a prefix (unencoded) of a given target string. You can also use a Patricia tree when the keys are integers or floating point numbers, not just as a lookup table, but also as a priority queue. I provide C++ code implementing and timing these routines, comparing Patricia trees with the STL map datastructure, and with STL priority queues, and also with leftist trees (a less common priority queue structure, described in Knuth).
It is also worth noting that the Patricia tree data-structure described above does not require storage allocation while in use, and so can never fail due to storage exhaustion. Leftist trees share this advantage. The priority queue structures described here also allow the deletion from the queue of nodes other than the topmost item, a feature not shared by the STL priority queues (which are, however, significantly faster). Of course, I provide (without warranty) code for all of this. As a bonus, you also get an allocation-free list class, a class for timing routines, and code to override new(), new() and delete(), so as to detect common storage allocation and overwriting errors.
A Java-based Patricia tree implementation, PatriciaGeneric, is one of the programs described in Short Descriptions of Java classes. This is quite flexible and configurable, and has been used to produce a class to find the nearest neighbours of query points in n-dimensional space. See Nearest Neighbours.
I would like to try a prediction method that starts by clustering the observations into a tree without looking at the variable to be predicted, and then prunes this tree to split out groups of observations to be included in the same regression. So far I have only got the clustering coded, but that does include an implementation of a recent clustering algorithm. See Trees with prediction in mind.
What if you want a list where insertion, deletion, and moving to the nth position in the list are all reasonably efficient? A well known idea is to hold the list as a balanced tree, and keep track at each point of the number of items in the tree below that point. Stefan Nilsson has written Java code to do this that implements the Java collection class List interface. He uses "generalised balanced trees", which have less overhead for balancing than red-black trees and the like, but offer only amortised complexity guarantees. That is, over a long period of time, pretty much every operation takes time that goes up only with the logarithm of the number of items in the tree, but any single operation involving modification may take longer. Note that I said involving modification - the most common datastructure justified via amortised complexity is the splay tree, where even searching is only efficient in amortised terms. However, with generalised balanced trees, the cost of a search is only logarithmic. You can find these off this link to Nilsson's home page. (Look for "An efficient implementation of the Java List interface", under software).
I have been working on a Java client-server system where the server provides callbacks to the client. With straightforward use of RMI this means that the server requires stub class files that depend on the client code. These are supposed to be dynamically loaded, e.g. from a web server, which may be a pain. In our case we package both server and client in a single jar file, which is actually more convenient in this particular case.
This is all very well, but I would be interested in a pattern that allows the server to provide callbacks to arbitrary client code without importing its stub classes onto the server.
The pattern Remote Callback uses a second interface and a wrapper class that delegates to the real client code. The stub class is generated from the wrapper, and so is guaranteed to depend only on the interface, and not on the client code: it can be distributed with the interface. The second interface and wrapper are so constrainted by the original interface that they can be mechanically generated from it. A Java program for doing this (via Javadoc) is provided, as part of a zip file pointed to by the Remote Callback writeup.
I have a set of Java classes that make it easy and practical to keep enough information to recover the state of a set of files and directories at any time in the past. If you are working as a single programmer and don't have convenient access to a proper Source Code Control System, this may help: A simple versioning system in Java.
A heavyweight example of this is PHP, ASP, JSP, and so on, but using template files to separate boilerplate from variable program-generated chunks is too useful to use only via these. Classes for template generation of XML, HTML, etc. describes a very simple template-generation package that is small enough to be called from almost any Java program, and that acts as a subroutine, not a substitute for a main program. It is not limited to XML and HTML either, because it knows nothing whatsoever about the structure of the data it is producing.
As the saying goes, correlation does not prove causation. Furthermore, in the most common statistical circumstances, there are symmetries that make it obvious that we cannot tell cause from effect. In Cause and Effect: Which is Which?" I investigate cases when the most natural statistical models do not have these symmetries, but come to the conclusion that this provides no practical advantage. In most cases, we wish to distinguish cause from effect in order to apply a cause, and I doubt that, even if we guessed right, we could predict the consequences of applying a cause that we have never before applied.
Finite State Machine have long been used to describe requirements, and UML popularised a more expressive extension, Harel Statecharts. The package uk.co.demon.mcdowella.fsm contains classes to emulate a Finite State Machine, given, at run time, a description of the machine. There is more info at A Finite State Machine Package
Suppose that you have a number of buttons and labels to put onto a Swing GUI. Laying them out in a grid looks neat and tidy. Grouping related components together into a bordered rectangle would emphasise their relationship for the user - but isn't that natural to do with a GridLayout or GridBagLayout. With the new SpringLayout, one possible trick is to create a bordered panel and give it constraints that put it over a group of related components. SpringBorders.java is a Java file demonstrating this.. Since SpringLayout is new, here is are some notes on it.
Partly for fun, and partly as homage to my Biology teacher, here follows a link to the writeup of An experiment with Cuttings. I will try to do better next year.