Short descriptions of Java classes

Java Jar file

Here is a jar file containing Java source referred to here, renamed to a .zip in the hope that it will prevent this web server from translating its line feeds. (So you'll have to use jar vfx mcdowella.zip to extract all the programs before you can do anything sensible with it). If you have Java you have jar. If you really don't have jar, use pkzip or one of its imitators to extract the files. It's source because I've sometimes found it's easiest to spot mismatches or messed up CLASSPATHs in Java if you recompile everything, and to encourage you to keep thinking of these pages as a source of ideas that you may or may not decide to agree with, rather than anything I've claimed must be true. Which brings me to...

This is not my day job

I'm just having fun here! I don't work at statistics for a living. So by all means pick over this piece by piece and make your own mind up. Don't say you weren't warned if all the code is buggy and all the proofs invalid.

Programs

Here is a brief description of the programs and classes in the jar file. All of this is in Java, and most requires Java 1.2 = Java 2. Some now requires Java 1.5 (This is because I am a great fan of Collection Classes, which Java 1.2 provides in java.util, which Java 1.5 provides Generics, to make Collections type-safe). Most of the programs take arguments from the command line, which is of course not 100% Java (Java guarantees access to a GUI but not that users have access to a command line). They will all give you some sort of usage message if passed a '-garbage' parameter. I'm afraid the definitive documentation is the program source.

Action

This class represents Actions in the Finite State Machine package. You subclass it, overriding act(), to get your own code called when the Finite State Machine receives the correct Event.

AhoCorasick

This is an implementation of the Aho-Corasick algorithm, which is a quick and easily used algorithm for searching for strings. It only handles exact matches, but it is very cheap (constant amortised time per character, independent of the number of strings to look for) and it can accept characters one at a time and signal a match as soon as it receives the final character of the match.

AllPairs

This is mostly used for generating sets of test data. If you have a program that takes a number of input variables, each with a small number of options (or classes of values) it turns out that it is reasonably practical to put together a set of input data, such that each pair of input variables has each of its possible combinations of two options in at least one of the input data points. This is described further in All-Pairs Testing.

Assignment

This solves the Assignment problem, closely following the method given by D.E.Knuth in "The Stanford Graphbase", which uses the Hungarian algorithm. It expects a matrix of size n2 as input and takes, in the worst case, time n3. (It will in fact run on a matrix that is not square, as long as there are at least as many columns as rows, so that it can find a column for each row).

ByRank

This implements the Ranker interface. This allows you to hold an array of longs and address the array by rank order as well as by position. So you can find out the index of the highest or lowest value in the array, or the median value, or whatever. This is implemented using PatriciaGeneric, so it should be reasonably efficient, even for large arrays.

Callback

This is a simple callback interface.

Closest

This is a class to find nearest neighbours of a query point. It uses a balanced tree, but isn't very careful of where to put new points, so the bounding boxes for children of nodes in the tree can overlap, which means that nothing very good can be said about the query time, even if you are prepared to be pretty optimistic. Use PatriciaClosest instead.

Columns

This program reads in a table of data and works out the sum down a single column in that table, or down a subset of a group of columns from that table. Then it works out the probability of getting a value at least as extreme as this, if the values in each row were permuted at random. Normally it does this by considering all possible permuations for each row individually, then convolving together the distributions this gives for the contribution from each row to the sum. For it to do this it needs the sums to be of integers, and it will scale and round the input values as directed to achieve this. It will optionally run a monte-carlo test to confirm the probabilities calculated. Combined with Bonferroni's inequality, this gives you an exact, or semi-exact version of the one-way analysis of variance.

Connected

This works out the strongly connected components that make up a graph. See Connected Components for details.

CorPredictor

This is the rank-based prediction and confidence interval code referred to elsewhere. It's just a class, and has no main routine.

CycleDetector

Objects of this class are presented with a series of non-null pointers, which will have their equals() method called. If the series repeats continually, the objects are guaranteed to throw an exception sooner or later. So they can be used to detect infinite loops caused by e.g. passing a cyclic graph to a depth first search routine designed for a tree. This means that your depth first search scan doesn't have to mark nodes as visited to detect such cycles. This class is based on algorithms in Knuth Vol II designed to detect cycles in random number generators.

Deviant

Accumulates a running total of the mean, variance, min, max, sum, etc. of a sequence of doubles, using the algorithms in Knuth Volume II section 4.2.2 to guard against numerical instability. This can get bad enough in other approaches to produce apparently negative variance. As somebody pointed out on usenet, some calculators are prone to this, and make mistakes if you ask them the standard deviation of {1000000, 1000001, 1000002}. I believe early versions of Microsoft Excel were too, but later ones may not be - or may at least require more digits.

To run deviant you just let it read in a list of numbers from its standard input - e.g. java uk.co.demon.mcdowella.Deviant < inputFile. After it's read to the end of file it will output the min, max, mean, variance, and so on. It will ignore everything in its input from '#' to the end of the line, so you can put comments into its input if you want.

DoubleColumns

This was written to process the data for Birds in Devizes pools. It reads in a file in a format that says how many birds of what sort were seen in what locations and spits out the data reformated in a way that allows the first half of it (after it has been split out) to be tested with Columns, and the second half (again after editing) to be displayed with R or Minitab. Partly because of the editing required, and partly because it was written for one specific problem, it's not as generally applicable as most of the programs described here.

DoubleHeap

Keeps track of a double-ended heap: you can quickly insert items into this, find the largest and smallest items, and optionally remove them. See Double-ended heaps

DynamicClosest

This is a data structure used to find nearest neighbours. It is actually a wrapper round StaticClosest, which can find nearest neighbours, but not update its data structures efficiently. DynamicClosest works round this problem by keeping multiple StaticClosest data structures, of sizes roughly corresponding to powers of two, to provide a decent amortised cost for updates (that is, some operations will be expensive, because they require large StaticClosest data structures to be created or destroyed, but this happens infrequently enough that the per-update cost is reasonable). PatriciaClosest, which uses a completely different data structure, is reliably cheap to update, and actually has better query performance in my tests, so you should use it instead.

EditCost

This uses a simple dynamic programming method to work out the least cost edit distance between two strings. See Edit Distance/ diff for details.

EuclidNode

This class is used by Closest and StaticClosest, which perform nearest neighbour searches. It represents a point in space that can work out the Euclidean distance from itself to other such points. You are probably better off using PatriciaClosest than Closest or StaticClosest.

Event

This represents Events in the Finite State Machine class. You use these to set up transitions in the description of the machine, and then pass them in when the machine is being run to trigger transitions and actions.

EventListener

I have noticed that one of the things that tends to change with a Finite State Machine, in practice, is that you want to have more and more things happen when it changes state, or fires actions. So I provided an EventListener class, so that you can have stuff pick up state transitions and events without modifying the definition of the state machine itself. You attach an object implementing the EventListener interface to a node in the tree of states and substates, and it will then receive all events received at that state, and, optionally, at its substates. This includes Entry and Exit actions, except that you only receive the lowermost Entry and Exit action per transition.

FindScattered

This class has one useful static method, and some other definitions that are now largely obsolete, because the classes that use them are inferior to PatriciaClosest. The useful static method is also called findScattered(), and it is one of the few clustering routines whose cost does not grow quadratically with the number of points to be clustered. The other obvious candidate for this acolade is K-means, but it is best started off from a good set of candidate points. FindScattered could be used to find such a set of points. Its output consists of a set of N points (where N is one of its input parameters), each of which represents a cluster centre.

FindScattered deals with the points handed to it one by one, and has the disadvantage that its result depends on the order in which the points are handed to it. It can, however, guarantee that if there are N clusters in the data, and the smallest distance between any two points in different clusters is larger than the largest distance between any two points in the same cluster, then the cluster centres (actually points) returned by findScattered() will include points from all N clusters, as long as it is asked for at least N points.

FindScattered starts off by grabbing the first N points it sees. It then works through the data points one by one. If the minimum distance from the N points it has to the new points is larger than the minimum distance between any two of the N points, then one of the two points separated by that minimum distance is removed from the set of N points and replaced by the new point. This then increases the minimum distance between those two points (except where the minimum distance is tied, in which case it decreases the number of minimum distances).

If the points can be divided into N clusters as described above, then as long as the N points do not contain representatives from all N clusters, then whenever a new point from one of the missing clusters appears the minimum distance from any of the N points held to the new points will be an outside-cluster distance, but at least two of the N points held will be from the same cluster. So one of these duplicated points will be ejected and replaced by the new point. This will continue until all N clusters are represented.

FiniteStateMachine

This represents a Finite State Machine. Its constructor takes a tree of states and subsets, annotated to describe transitions, and it initialises itself from this, ignoring any later changes to the tree passed in. You call its acceptEvent() routine to pass it events. For more information see A Finite State Machine package.

FsmTest

This is a test/demonstration routine for the Finite State Machine package, uk.co.demon.mcdowella.fsm.

GlassesLinear

This program works out the likely result of an experiment to see if reducing near work changes the progression of myopia, under specified assumptions about the underlying model and the amount of random noise perturbing the results. It assumes that the data will be analysed using a pre-specified linear discriminant function. This isn't very realistic, but makes it easy to work out the number of volunteers required to produce a result of any specified strength. It expresses many of its results and sigmages; a sigmage is the mean value of a particular statistic divided by its standard deviation. See also GlassesReg, which works out the likely result for a much more reasonable analysis, using linear regression, but which does this by simulating a very large number of experimental runs.

The parameters to GlassesLinear are:

-age #
This specifies the age of the participants. In the model, the participants are assigned a linear rate of progression of myopia at birth, from a normal distribution. This therefore influences the variability of the myopia in the participants. The default is equivalent to -age 20.
-change #
The change in the progression of myopia brought about by the intervention. This is assumed to be linear; the myopia seen in the intervention group is taken to be the myopia they would have had in the control group plus this figure. The default is equivalent to -change 0.2.
-meas #
This is the standard deviation of the measurement error in the readings observed. The default is equivalent to -meas 0.4.
-rate #
This is the standard deviation of the rate of progression of myopia in the participants. This is modelled as if assigned to them at birth. This therefore combines with -age to affect the variability of myopia between participants. The default is equivalent to -rate 0.3.
-size #
This is the size of each group, for both the control group and the intervention group (so the total number of participants is twice this). The default is equivalent to -size 100.
-target #
This is the target sigmage; the program will print out the group size required to achieve this sigmage for the linear discriminant test, holding the other parameters unchanged. The default is equivalent to -target 5.0.
-tries #
The program calculates what should be the best possible linear discriminant. As a check, it also generates a number of linear discriminants at random and prints the sigmage of the best one. This parameter controls how many of these to generate. The default is equivalent to -tries 1000.

The output from the program starts with a copy of its input parameters. Then it describes the variance-covariance matrix expected for control group (and the intervention group, since the intervention is pressumed only to shift the mean). After that it shows the best linear function to apply to the observations to show a difference between the two groups, the resulting sigmage, and the group size required to attain the target sigmage. Finally it shows the sigmages attained for a variety of other linear discriminants, including the best one out of the random linear discriminants generated.

The program doesn't pay very much attention to the sign of the refractive error it generates, or attempt to apply the intervention only to data points that correspond to myopes. Because almost everything here is linear, this does not affect the results.

GlassesReg

This program works out the likely result of an experiment to see if reducing near work changes the progression of myopia, under specified assumptions about the underlying model and the amount of random noise perturbing the results. It analyses the data using linear regression. To get an idea of the expected results and their variablity it generates a large number of different versions of the data, at random, analysese them, and prints out statistics summarising the results found. See also GlassesLinear, which works out the results that would be produced from a simpler, but less realistic, analysis, without using simulation.

The parameters to GlassesReg are:

-age #
This specifies the age of the participants. In the model, the participants are assigned a linear rate of progression of myopia at birth, from a normal distribution. This therefore influences the variability of the myopia in the participants. The default is equivalent to -age 20.
-change #
The change in the progression of myopia brought about by the intervention. This is assumed to be linear; the myopia seen in the intervention group is taken to be the myopia they would have had in the control group plus this figure. The default is equivalent to -change 0.2.
-meas #
This is the standard deviation of the measurement error in the readings observed. The default is equivalent to -meas 0.4.
-rate #
This is the standard deviation of the rate of progression of myopia in the participants. This is modelled as if assigned to them at birth. This therefore combines with -age to affect the variability of myopia between participants. The default is equivalent to -rate 0.3.
-seed #
This is the seed used to start off the random number generator used. Different runs with the same seed and parameters should be produce exactly the same 'random' observations and so should produce exactly the same results; if you want to try an independent rerun, for instance after tuning the parameters, you need to specify a different seed. The default is equivalent to -seed 42.
-size #
This is the size of each group, for both the control group and the intervention group (so the total number of participants is twice this). The default is equivalent to -size 100.
-thresh #
This can be used to set a threshold for the level of myopia required to enter the study. If the level of myopia seen at the first prescription is not this value or less (myopia is -ve in dioptres, so lower values are stronger), then the observation is discarded and a new data point is generated in its place, repeatedly if necessary. This is intended as a check only, since the power of the experiment should not be reduced by this. The default value is so large as to turn off thresholding. Running with -thresh -6.0 appears to make only a very small difference to the results, most of which is probably just random fluctuation.
-tries #
This is the number of experimental runs to simulate. The default is equivalent to -tries 1000. The results in the writeup Myopia and Near Work were produced with -tries 50000.

The output from the program starts by printing out its parameters. Then it prints out some summary statistics for the deviance parameter (which looks to see if the control and intervention group are different). This includes the mean and variance of this parameter in all the different experimental runs. As you can see by running the program with -change 0, this will have mean 4 and variance 8 if there is no difference. If there is a real difference then, with enough data, the value seen will be far enough away from 4 to show that there is a difference. The expected difference summarised next is a measure of the effect of the intervention, as seen by the analysis; it is the mean effect expected due to the intervention. The program next prints out the mean claimed standard error for that effect, which can be checked against with the actual standard deviation printed out above for the expected difference. This gives a measure of the claimed accuracy of the value reported for the intervention.

The next three lines describe the least significance level achieved for different proportions of the experiments run. The first gives the significance level achieved or exceeded by 95% of the experiments (labelled as the 0.05 level), the next the significance level achieved or exceeded by 99% of the experiments (labelled as the 0.01) level, and the third the significance level achieved or exceeded by 50% of the experiments (labelled as the 0.5 level). These are so you can see what level of significance you might expect, depending on how risk-averse (or how lucky) you feel. The tail probability (marked just as 'tail') is an indication of the level of significance achieved. This is the probability of seeing a value at least as extreme as that observed, if there is no underlying effect, so more significant values correspond to lower numbers. Exactly how impressed people are by different levels of significance varies to some extent between different research areas, but most people would be at least interested in anything below 0.01.

The final line shows the linear statistic as predicted by GlassesLinear for a particular linear function; it is there mostly as a cross-check between the two programs.

JunitLCA

This class exists to test LCA, using the Junit test framework of Kent Beck and Erich Gamma - see links off http://www.xprogramming.com/software.htm . It isn't stored as JunitLCA.java, because it doesn't compile on its own unless you have Junit installed and in your classpath.

LCA

This is an implementation of the sequential portion of Schieber and Vishkin's Least Common Ancestor algorithm. You give it a tree (represented via the TreeNode interface) and it builds a datastructure in space and time linear in the size of the tree. It can then answer Least Common Ancestor queries in constant time. That is, given two nodes in the tree, it can find the node furthest from the root which is still an ancestor of both the specified nodes. This algorithm, which is reasonably simple (arguably beautifully simple) wasn't published till 1988 and was preceded by a more complicated algorithm, by Harel and Tarjan in 1984. So it gives me hope that there might still be simple and efficient algorithms out there waiting to be discovered. As far as I know, there aren't any direct uses for this algorithm, but it is used in some of the algorithms that build suffix trees. These data structures are used to hold information about strings which can be used, for instance, to search them for substrings very quickly - this area has taken off since the introduction of DNA sequencing.

Log2

Given an integer >= 1, a static method in this class can return the logarithm base two of that number, rounded down to an integer. It uses a static table extended as necessary. Another method based on this will return the number of contiguous low order zeros in the binary representation of its argument. These methods are used by LCA.

LogFact

Returns the natural logarithm of a factorial, using a table to provide decent speed for small numbers. Large ones will run your program out of memory. This is just a class, not a program.

LU

This works out the LU decomposition of a square matrix, so you can use it do solve linear equations. This permutes both rows and columns to pivot, so that the pivot chosen at each stage of the decomposition is the largest element anywhere in the square of available elements. This may help stability a little, although according to Numerical Recipies, the gain is in fact minimal.

make.bat

This is an msdos batch file I use to compile everything here and create javadoc documentation from the source.

MaxInRange

Allows you to maintain an array, and find the location of the maximum element within any subrange of the array, without searching the entire subrange. See Finding the maximum value in a range. Also contains methods mapping longs to doubles and back again in a way that preserves relative order.

MaxInRangeTest

Test harness for MaxInRange.

MannWhitney

This contains a variety of methods to do with the Mann-Whitney test. The main() method here is just a test harness; the class MannWhitneyCli will allow you to run Mann-Whitney tests from the command line.

MannWhitneyCli

This class contains a main program to do Mann-Whitney tests from the command line. It takes a single parameter, -conf #, e.g. -conf 0.95 (the default). This is the target confidence to use when expressing the result as a confidence interval. It reads its input from the standard input stream. This should contain lines of numbers, with # meaning a comment from there to the end of the line. Alternate lines with at least one number of them are assigned to the two groups used by the Mann-Whitney test. It gives a significance against the null hypothesis that the two groups are identically distributed, and a confidence interval for an offset between the two groups, assuming that this difference is offset is the only difference between the distributions of the two groups. The significance test accounts for ties, but the confidence interval does not. This means that if there are ties, it will be slightly conservative (produce a confidence interval that, under the assumptions of the test, will be slightly more likely than it claims to contain the true offset). The confidence interval achieved is stated, as two probabilities, one for each side of the confidence interval. It will usually have at least the confidence requested, but this is not always possible.

MbTest

This uses MultiBin, and so is incidentally a test harness for it. It is used to produce some of the results in Compare and Contrast. It generates random comparisons between two items. The probability with which one item is preferred to another is based on a log-odds model. MultiBin does a preset number of comparisons, assesses the significance of the result with MultiBin, and then repeats the exercise to build up a picture of the likely results of running experiments with the specified number of comparisons and then assessing their significance with MultiBin.

MinSpan

This creates a minimum spanning tree, given a distance function. It uses Prim's algorithm, and computes all possible distances. This means that, for a graph with n nodes, it takes n2 time, but thankfully only n space.

MoverFrame

This brings up a gui that you can use to tweak a number of parameters. Then you hit Go and it displays a lot of small squares. Either they move smoothly from right to left or you get brief glimpses at fixed squares before it changes picture. A few of the squares will be moving down, either right down, or down as well as the general sideways movement. When it stops changing the picture, you fill in the box labelled User Count with the number of squares moving down and hit "Count" before starting again from Go. The program will tell you how many squares were really moving down and write this out to standard output, which you might want to direct to a file. The point of this is to see whether movement from right to left makes it harder to detect the downward component of the motion. See Scanning for Motion

MultiBin

This reads in a file line by line. Lines that start with a # character are ignored. Other lines must contain two integers, which are assumed to be the counts of wins and losses for a given situation. The program sums either the numbers in the left hand column or the squares of the differences between the two columns. It then computes the probability of the observed sum, under the hypothesis that the wins and losses in each row come from a binomial distribution with a given parameter, just as they would do if each row recorded the number of heads and tails seen after tossing a coin a given number of times.

The option to work with the sum down the left hand column is intended to look for situations when all the rows show a bias away from the specified probability in the same direction; in fact the result is the same if all the rows are summed down columns to form a single row. The option to square the difference is for cases when the rows are biased, but different rows may have biases in different directions. It cannot be calculated by simply summing down rows; the two groups {{0, 10}, {10, 0}} show a bias within each group that changes direction, whilst their sum {10, 10} shows no bias at all.

The parameters to MultiBin are:

-sqr
This asks it compute the probabilities for the sum produced by squaring the difference in each row and then summing the result. The default is to compute the probabilities for the sum formed by summing down the left hand column.
-probLeft #
This is the probability to associate with the left hand column. The default is 0.5.

ParaClust

This reads a file containing paragraphs separated by blank lines. It uses the cosine similarity measure to compute a distance between paragraphs, and then works out the minimum spanning tree of the resulting graph and prints it out. The time taken for this goes up with the square of the number of paragraphs, but a e.g. 182 paragraphs take it almost no time at all.

The idea of this is to group related paragraphs close together in the resulting output document, in the hope that reading it in the new order will make you notice new connections between the paragraphs.

The parameters to ParaClust are:

-clust
This asks it to use the minimum spanning tree to produce a true clustering, so that all the original paragraphs appear as leaf nodes, and the tree structure shows the nesting of groups of leaf nodes. It cannot be used with -dim.
-comp
This asks it to use Java's compression library as a distance measure. If we have cl(a, b) = size of (a,b) after compression, then the distance measure used is distance(a, b) = cl(a, b) / (sqrt(cl(a,a) * cl(b,b))) - 1. This obeys at least distance(x, x) = 0, but doesn't necessarily even obey distance(a, b) = distance(b, a). It is also slower than the cosine similarity measure.
-dim
This asks it to print out the minimum spanning tree, but to choose the root and order of printing more carefully. It finds the longest path through the tree, then uses one end of this as the root, and prints out the children of each node in the tree so that any child connected to the node by an edge on the longest path is printed last. This produces quite a deep tree, but the order in which it visits the nodes makes the total distance travelled along the tree (counting backtracking between nodes) as short as possible. This is because we travel along every edge of the tree twice, except for the edges from the first node printed to the last node printed, which we traverse only once. This option cannot be used with -clust.

PartialSums

This class holds an array of doubles, the size of which is given in its constructor. You can read and write the elements of the array. Reading costs you constant time, as you would expect. Writing costs you time that grows with the logarithm of the number of elements in the array. As a return for this, you can retrieve the sum of the first n elements of the array, for any n, in time that also grows with the logarithm of the number of elements in the array, independent of the value of n. I have added a convenience function that simply subtracts two of these partial sums to work out the the sum of any continguous interval of array elements, in time that grows only with the logarithm of the number of elements in the array.

This class works by keeping a binary tree, storing it as arrays of size len/2, len/4,... 2, 1. Like MaxInRange, it shows that it sometimes possible to frame problems in a way that allows you to solve them with balanced binary trees without actually writing any code to keep the trees balanced, which saves you a huge amount of effort.

There is a statistical application of this class. Statistics such as Kendall's Tau require you to take a sequence of values and count the number of pairs of values where the later value is greater/less/equal to the earlier value. There are n(n-1)/2 pairs, but you can work out the counts in time n log n. First of all sort the values, keeping track of their original rank. Now deal with them in sorted order, dealing with ties as one chunk of values. It is enough to know, for each value, how many lesser values were originally seen before and after it, because all the lesser values seen before produce one sort of pair, and all the lesser values seen after produce another sort of pair. We can work this out by keeping an array with as many slots as there are values, and assigning one to each element's value slot once we have dealt with it. The number of lesser values seen before the current element is the partial sum from 0 to just before that element's value slot.

PartialSumsTest

This is a test harness for PartialSums.

Patricia

This algorithm is written up in "Handbook of Algorithms and Data Structures", by Gonnet and Baeza-Yates (and also in Knuth Vol III, but I found the Handbook easier to produce Java from). This was my first version of a Java version of the Patricia tree: see PatriciaGeneric for a more recent version. This version is designed to maintain a Map between Strings and values, where the Strings are constrained by the rule that no key is a prefix of any other key. In return for this, you produce a binary tree whose maximum possible depth is the length in bits of the longest string, and you navigate this tree by comparing only single bits within strings, not by making string compares, so it should in theory be good for long strings, especially those starting with a common prefix. I have also added some book-keeping to make it easy to find the nth String in lexicographical order, or the nth String in lexicographical order starting with a given prefix. Since not everything is a String, and not all collections of strings obey the prefix rule, I also provide routines to encode ints, longs, floats, doubles, and arbitrary strings in ways that keep to the rules but maintain the original comparison order.

Unfortunately tests show that the result isn't overwhelmingly efficient. Not counting the costs of any string encoding required, it runs about as quickly as TreeMap, which makes it much slower than HashMap. Being able to move quickly to the nth position in sorted order might be handy in some cases, though.

More recently, I have returned to this in C++, writing a version where any necessary string encoding is performed on the fly. This seems to claw back some of the performance gap, gaining a factor of two over STL maps when the keys are strings. This work is written up separately in Patricia Variations.

PatriciaClosest

This is a class, constructed with PatriciaGeneric, for looking up keys of arrays of doubles to find associated values of arbitrary class, and for finding values close to a query point, treating the arrays of doubles as coordinates of points and using the euclidean distance between points. Each double in the array is mapped to a long using a monotonic invertible transform whose existence is guaranteed by the floating point standard. These longs are then (virtually) bit-interleaved to form a single vector of bits, which is used as a key in the Patricia tree implemented by PatriciaGeneric. Information is kept at each internal node on the bounding box of the points corresponding to keys held at the children of this node, and this is used to guide the search for nearest neighbours. The Patricia tree data structure means that all the children of the node start with the same prefix of bits, and this prefix is different from that starting the children of the sibling of this node. Taking account of the mapping from arrays of doubles to bit vectors, this means that the children of two sibling nodes will always be separated by a one of the co-ordinate planes, so searches for nearest neighbours should be reasonably efficient. In particular, if there is a close enough match to the query node and we only want one result, this will be found in one trip down the tree, and the bounding box information will then rule out all other possibilities on the way back up the tree.

The nearest neighbour search will take a bound on the maximum distance that is interesting. If this is set very small, then even failures to find close matches can be made efficient.

See Nearest Neighbours for more info on nearest neighbour search.

PatriciaGeneric

This is a version of the Patricia tree algorithm modified to use Java generics, and to use 'virtual' bit extraction, so that keys don't have to be reformatted to be placed in the tree. For inserted keys, this isn't a big deal, but I thought avoiding reformatting would speed up queries. The Patricia keys may now be of any type that implements an interface for providing bit values and finding the first difference between two sources of bits. The values stored can be of any type whatsoever, and Java Generics mean that the resulting code is known to be typesafe at compile time.

A third type parameter is also provided, which is the real point of the code. This allows the caller to have objects created at the internal nodes, which are given callbacks so that they can accumulate information about the nodes below them in the tree. There is also enough access for clients to navigate the Patricia tree using the information kept in the nodes. The result is a general framework for annotated trees. Patricia is especially suitable for this, because inserts result simply in the editing into the tree of a single internal and a single data node, with no further remodelling of the tree, while inserts result simply in the deletion of a single internal node and a single data node. This means that most updates will require relatively little book-keeping to adjust the information held at the internal nodes (depending on how far changes propagate up the tree, of course).

The class PatriciaClosest is an example of this. Here the PatriciaGeneric data structure is used to build a map between arrays of doubles representing coordinates and arbitrary values which can be searched efficiently to find the closest N points in the map to a query point. I have tried various methods of searching for closest points, and this seems to be one of the fastest, as well as one of the simplest.

PatriciaGenericTest

This is a test and timing harness for the PatriciaGeneric class described above.

PatriciaTest

This is a test and timing harness for the Patricia class described above.

PermFHL

This is an implementation of the algorithm due to Furst, Hopcroft, and Luks to do a variety of tests on groups of permutations. Given a set of permutations it builds a table which can be used to work out the order of the subgroup generated by them, test if other permutations are members of that subgroup, and work out how they might be generated if so. I only work out how many steps that would take, because it generally takes too long to make it sensible to print out the answer. My implementation comes partly from a memory of a report on their original work and partly from the book 'Fundamental Algorithms for Permutation Groups', by G. Butler (which goes on to describe faster methods which have now superseded FHL). This implementation also tests a wrinkle intended as a heuristic to get it to build a table that might produce shorter sequences. The sequences it would produce are still far too long to make it your preferred way of solving Rubik's cube, though. In fact (born and brought up in Northern Ireland) I'd say the Irish approach is still faster: repaint the thing!.

PermTester

The main program in this class runs PermFHL to read sets of permutations from its standard input, and report on the subgroups they generate. This includes not only the size, but the mean and standard deviation of the sequences that would be produced if you used the PermFHL algorithm to generated sequences of generators that would multiply together to produce randomly chosen elements of the subgroup. This is checked (to some extent) via Monte Carlo. Sample input, in PermTester.in, includes generators for the Rubik's cube, in its own particular format (which represents e.g. the identity permutation of order 4 as [0, 1, 2, 3]).

The parameters to PermTester are:

-file <filename>
The name of a file to which it will write comma-separated output which you can feed to any statistics package to find out if any of my ideas for sequence length reduction make any difference in practice. As far as I can see, they don't. The default is not to write this info out.
-goes #
The number of random permutations to attempt to decompose to check the predictions of the mean and standard deviation of the length of the sequences used to decompose a randomly chosen permutation from a given subgroup, and the calculation of the size of the subgroup. The default is 1000.
-runs #
The number of random sets of permutations to generate as internal test data. The default is 10.
-seed #
The seed for the random number generator (this is a long). The default is 42.

Permutation

This class can multiply and invert permutations, generate them at random, print them out, and read them in. It is used by PermFHL and PermTester.

Permute

Contains a static method to randomly permute an array of ints, and a method to generate random contingency tables with the same marginals as a previous example given in a constructor. You should probably use java.util.Collections.Shuffle() instead, if you can. This is just a class and doesn't contain any programs.

PredTest

This is the test code referred to elsewhere that showed that rank-based prediction wasn't competitive with normal linear prediction and leave-out-one cross-validation.

The main routine in this class doesn't do anything generally useful: it just runs the tests I wrote to compare rank-based prediction against prediction based on normal linear regression.

ProbCheck

This allows you to run a soak test that covers some features of RoughRandom, RowDist, and TabDist (described below). It creates a large number of random 1xn and 2xn tables with RoughRandom, and uses TabDist and RowDist to assess their significance. You might spot any remaining bugs in these routines if it falls over with some sort of exception (There are internal checks in TabDist and RowDist to check that distributions calculated so far sum to 1.0, for one thing). You might also reckon something odd if it thinks it has found a very unusual 1xn or 2xn table. To help you spot this, it uses TopN to keep track of the most significant results so far. When it finishes, it outputs these in order, with the significance found, both raw and adjusted by multiplying it by the number of tests made. This follows from Bonferroni's inequality: P(A | B) = P(A) + P(¬A & B) <= P(A) + P(B). It checks for input every now and again, so if you give it a long run and get fed up, you can get it to stop by typing some text at it. Any text will do, but Java won't send it on till you type a carriage return, so you need to include that in it.

The parameters to ProbCheck are

-goes <number>
The number of iterations of the main loop to take, if it is not stopped early by typing some text at it, followed by a carriage return.
-grain <number>
This is the grain parameter for the semi-exact significance calculations. The bigger this number is, the closer the statistic actually tested is to the test you actually intend to make, but the more expensive the test is.
-range <number>
This number is used to choose the size of the 1xn and 2xn tables created. It should be > 0. The table size will be chosen at random between 2 and 2 + range - 1.
-samples <number>
This is the maximum number of samples in all in the 1xn tables used. The 2xn tables are actually two 1xn tables reused, so it affects these too.
-seed <number>
This is the seed used to initialise the random number generator.
-top <number>
This is the number of most significant answers to save.

QuantileBounds

Works out quantile bounds for a given confidence. For instance, it can tell you that if you take the lowest and highest values in 6 independent samples of some statistic, the probability that this range does not include the underlying median is 0.03125. More generally, if you tell it that there are N samples, and that you want a bound for the pth Quantile (0.5 is the median), then it well tell you that if you pick the ith to the jth (counting from 1) you will have covered the true pth Quantile up to some error rate (which will hopefully be no more than the error rate you have requested: if the error rate you requested can't be achieved you will still get an answer, but with a higher error rate). Its parameters are

-error #
This is maximum error rate requested. The default is 0.05, which is equivalent to the usual 95% confidence level.
-quantile #
This is the quantile required. The default is 0.5, which is the median. Other common choices are 0.25 and 0.75, the upper and lower quartiles.
-samples #
This is the number of samples available

The page Quantile Bounds (or why to take at least six samples) describes this in more detail.

Query

This is a test/utility program to run a specified Query against a specified JDBC database and output the result.

Ranker

This interface is for an object that defines a fixed length array of longs. Longs can be addressed by sorted rank, as well as position, with ties being broken by considering the position of the competing values within the long. This allows you to maintain an array with changing contents while always being able to pull off e.g. the highest value in the array, or the lowest value, or the median value, or whatever. There is only one reasonably efficient implementation of this here, ByRank.

RoughRandom

Can create random numbers in the range 0..n with arbitrary pre-specified probabilities. Uses Walker's method of aliases, as described in Knuth Volume II, 3.4.1. This is just a class and has no main routine.

RowDist

Will provide exact and semi-exact statistics as described elsewhere for a 1xn table of counts, where the null hypothesis is that the counts come from some specified distribution (e.g., all values equally likely). It can test the log-likelihood chi-squared, the repeat rate, and a trend statistic. It will also do Monte Carlo tests of these for you (intended partly as a check on the calculated exact values - the code is complex enough that this might be a good idea).

The main routine reads in the counts of a 1xn table from its standard input, optionally preceded by a list of probabilities. The command line arguments are as follows.

-chi
If this argument is given, RowDist will calculate the significance of a (grained version of) the log-likelihood chi-squared statistic against the theory that the underlying probabilities are as specified (equally probable if not specified).
-cols <number>
This parameter is mandatory. It gives the number of columns in the 1xn table.
-grain <number>
This gives the grain used to turn real numbers into integers to compute semi-exact statistics. The larger the grain, the nearer the test actually done is to the test you are aiming at, but the more expensive the calculation.
-mc <number>
This gives the number of Monte Carlo runs to do. The default is not to do any. In these runs random 1xn tables with the same total number of observations as the real input are generated, using the probabilities under the null hypothesis, and the statistics for these random tables are compared with the real table.
-probs
If this parameter is given, the program will attempt to read a list of probabilities before it reads the list of counts. These probabilities will be taken to be the probability of each count under the null hypothesis.
-rpt
If this parameter is given, the program will compute the exact significance of the repeat rate under the null hypothesis. If the null hypothesis is that all possibilities are equally likely, this statistic is a monotonic function of the Pearson chi-squared. If not, it's still accurate, but not really very useful. This calculation is not affected by the -grain parameter.
-seed <number>
This sets the seed used by the random number generator for Monte Carlo tests.
-trend
If this parameter is given, the program will compute the semi-exact significance of a trend statistic. It's only semi exact because if grain is too small the statistic will be grained (converted to a scaled integer) even though it is in fact an integer in the first place.

Smoother/TriangleSmoother

This smooths the data given to it. It can read a file with two numbers per line, an X value followed by its corresponding Y value. It fits a straight line to some given span of data and uses this to work out the smoothed value of Y. By default, it uses a weighted straight line fit, with weights forming a triangle like 1, 2, 3, 2, 1 where the centre weight is highest and points at the location where that straight line will be evaluated. This is actually done by the class TriangleSmoother: the smoothing built into Smoother does less well as it is not weighted. By default it will use cross-validation to choose a span after evaluating every possible span. This is practical up to about a few thousand data points because doing cross-validation for a single span takes only linear time. The command line arguments are as follows. This is relatively simple and seems to work pretty well in practice but be warned that (since everything is linear) it makes no attempt to be robust against outliers. See Three Smoothers for more info.

-rect
Use a 'rectangular' (unweighted) fit for each span. Not recommended, as the result isn't very smooth.
-span <number>
Requests a particular span. Without this, all possible spans are tried and cross-validation used to select one.
-x <number>
Requests that the fitted curve be evaluated at a particular point. Without this, the fitted curve is evaluated at all the x values in the input data.

SplineFit

This fits splines to binomially distributed observations, as described in Linear smoothing and binomial models. The idea is to draw a smooth line through the data and to display not only that line, with a confidence interval, but its derivative, complete with a confidence interval for that. It reads in a comma-separated input file and produces a comma-separated output file, which is intended to be fed into some other program to draw graphs. In practice, the program WeightLS appears to produce better results.

SparseEditCost

This uses a combination dynamic programming and spiral search to work out the least cost edit distance between two strings. It may be faster than EditCost if the two strings are quite closely related. See Edit Distance/ diff for details.

State

This represents one state in the Finite State Machine package. You build up a tree of states and substates using this, add transitions where required, and then pass it to the FiniteStateMachine class to have it build your Finite State Machine.

StaticClosest

This is a data-structure used to hold points so that we can easily find the nearest neighbour to a query point. In practice, it turns out that the PatriciaClosest class is both simpler and more efficient, so you should probably use that instead. StaticClosest is based on variation on k-d trees. The main difference is that it doesn't chose the dimension to split on based on a simple round robin, but uses a related splitter class, which tries to chose the dimension at each stage that provides a best split (whichever dimension contains the largest range of values before the split).

TabDist

Will provide exact and semi-exact statistics as described elsewhere for a 2xn table of counts, where the null hypothesis is that the row and column choices are independent. It can test the log-likelihood chi-squared, minus the log of the probability of the table given its marginals, and a trend statistic. The code is complex enough that a sanity check via tabMC might be a good idea.

The main routine reads in the counts of a 2xn table from its standard input, as a top row followed by a bottom row. The command line arguments are as follows.

-cols <number>
This parameter is mandatory. It gives the number of columns in the 2xn table.
-grain <number>
This gives the grain used to turn real numbers into integers to compute semi-exact statistics. The larger the grain, the nearer the test actually done is to the test you are aiming at, but the more expensive the calculation.
-ll
If this argument is given, TabDist will calculate the significance of a (grained version of) the log-likelihood chi-squared statistic against the theory that the underlying probabilities are as specified (equally probable if not specified).
-prob
If this argument is given, TabDist will calculate the significance of the statistic given by minus the log likelihood of the table given its marginals.
-trend
If this parameter is given, the program will compute the semi-exact significance of a trend statistic. It's only semi exact because if grain is too small the statistic will be grained (converted to a scaled integer) even though it is in fact an integer in the first place.

TabMC

Will run Monte Carlo tests on general contingency tables of counts, under the null hypothesis that the row and column choices are independent. It provides statistics for minus the logarithm of the probability of the table given its marginals, the log-likelihood chi-squared (printed out as G2 - it is sometimes referred to as G2), and two sorts of trend statistics. Both test for a trend along the rows. The first is the squared sum of the trend along each row, so it assumes no particular trend along the columns. The second tests for a trend down the columns at the same time as a trend down the rows.

The main routine reads in the counts of a mxn table from its standard input, as a top row followed by a bottom row. The command line arguments are as follows.

-cols <number>
This parameter is mandatory. It gives the number of columns in the 2xn table.
-goes <number>
This gives the number of random tables to generate.
-rows <number>
This parameter is mandatory. It gives the number of rows in the 2xn table.
-seed <number>
This parameter sets the seed for the random number generator used.

TestConnected

This tests a program to work out the strongly connected components that make up a graph. See Connected Components for details.

TestCorasick

This is a very simple test harness for the AhoCorasick class that turns it into a very simple imitation of fgrep.

TestCorasickRand

This test harness uses random sequences and random patterns to give the AhoCorasick class a more thorough test.

TestDemo

This is a batch file that runs all of the main programs listed here with some simple test input. It's partly a sanity check on the programs, and partly a list of very simple examples for you.

TestEditCost

This tests and times routines that work out the least cost edit distance between two strings. See Edit Distance/ diff for details.

TestLCA

This class exists to test the LCA class, which finds Lowest Common Ancestors in a tree.

TopN

This uses the Heap datastructure to keep track of up to the N best answers seen so far. (It's rather confusing, but to do this you use the heap to keep the worst answer still stored at the top, so you can compare it with each incoming candidate and replace it if necessary. Typically, you don't need to find the best answer quickly, although there are such things as double-ended heaps, if you really truly do). Very useful in any program for computer search. Knuth uses a heap in a heuristic algorithm for tree search called 'Stratified Greed' - divide the tree into strata, keep the best N nodes from the current strata, and use them to select points to search from in the next strata. See 'The Stanford GraphBase', by D.E.Knuth, published by Addison-Wesley, ISBN 0-201-54275-7. Computer search includes programs that perform a large number of statistical tests and keep the N most significant answers found so far (remembering, of course, to correct the significance using something like the Bonferroni inequality: P(A | B) = P(A) + P(B & ~A) <= P(A) + P(B)). Further pointers to search algorithms in the AI world are in the small survey article "Space-Efficient Search Algorithms", by Richard E. Korf, in ACM Computing Surveys, Volume 27, Number 3, September 1995, and chapter 36 (Artificial Intelligence Search Algorithms) in the massive "Algorithms and Theory Handbook" editied by Atallah, published by CRC press. I love heaps - I'd go on about heapsort except that I also love collection classes, and heapsort isn't one of the sort algorithms offered by the Java collection classes.

Unfortunately with good reason - heaps have appalling memory locality. One fix for this is to use blocked heaps. This is in L. M. Wegner and J. I. Teuhola, The external heapsort, IEEE Transactions on Software Engineering 15 (1989), 917-925. It came out of IBM work on databases using lazy evaluation. Read up on heaps, and then consider a heap of blocks. Each block contains several items, but has at most two child blocks, just as with a normal heap each item has at most two children. It is convenient to keep the items in a block in sorted order. The invariant is that the lowest key in any block is higher than the highest key in any child, so, as with a heap, it's easy to find the element with the highest key. To fix the heap invariant after modifying a block, you note which child has the lowest key in it. Now merge the two child blocks and put the low end of the resulting merge in the child with the previous low key. That child's lowest key cannot have decreased, so its heap invariant w.r.t its children is OK. The child with the higher low key gets merged with the dodgy parent, and gets the low half of the resulting merge. The parent must now have regained its invariant w.r.t its children, but the child with the higher low key may now fail its invariant w.r.t. its own children, so we procede down the heap as per normal. The point of this is that the cost of moving between blocks (which in the IBM case, may be 10ms to move a disk head) is amortized over all the many items in a block.

I haven't implemented any of this here, and it's probably dominated by the ideas about space-efficient search in practice. These amount to reverting to depth first search with some sort of threshold scheme that stops you covering all the nodes in the search tree. Spiral search is doing repeated passes with steadily increasing thresholds if you don't find the answer. Limited Discrepancy Search requires a heuristic that gives you the best child of each node. You first of all follow the heuristic straight down to a leaf. If that doesn't give you the right answer, try all paths allowing exactly one error, starting with the assumption that the first choice made was in error (as the one with the least evidence to consult). Then you could try for exactly two errors, and so on - but the originators (Harvey and Ginsberg at www.cirl.uoregon.edu) suggest that if you are making more than one error you may be better off finding better heuristics. Another option is to do depth first search after (before?) you use up all your memory on a heap-based scheme such as stratified greed.

TorczonSimplex

A routine in this will minimise a function passed to it by hillclimbing, using only function evaluations (so it doesn't need to know about derivatives). Unlike the Nelder-Mead algorithm, this can be proved to converge to a local minimum given reasonable assumptions. This doesn't necessarily make it efficient, even in relative terms.

TreeDefault

Pre-processes a tree and builds an index structure that allows you to answer queries of the form "what is the value of a given attribute at this node?", where attribute-value settings on a node are inherited by its children, unless explicitly overridden. See Default values in trees.

TreeDefaultTest

Test harness for TreeDefault.

TreeNode

This interface is used to pass a tree to the LCA algorithm described above.

WeightLS

This fits locally weighted least squares to binomially distributed observations, as described in Linear smoothing and binomial models. The idea is to draw a smooth (but usually curved) line through the data and to display not only that line, with a confidence interval, but its derivative, complete with a confidence interval for that. It reads in a line-oriented input file and produces a similar output file, which is intended to be fed into some other program to draw graphs.

The input file format is intended for records of wins and losses produced according to some underlying win probability pi, which changes with i. Each line describes a set of observations with the same underlying pi. The first column is the number of trials on that line. The second is the value of xi, the x-coordinate used to produce weights for the smoother. Typically you will also use this as x coordinate of a graph displaying the smoothed results. The third column displays the total number of successes in those trials. This does not have to be an integer; for example, when using this to display the results of football matches, I enter each match on its own line as one observation (so the first column is always one), use the second column to give the number of the match in sequence within the sequence, and then encode wins, losses, and draws in the third column as 1, 0, and 0.5 respectively. This is all the input file needs.

The output file produced has the same format as the input file, but has more columns. The first three are duplicated from the input file, and then we have, in turn, the fitted value, its upper and lower bounds, the derivative of the fitted value, and its upper and lower bounds.

The command line arguments -sig # and -width # affect the fit and its bounds. The line drawn is produced by making estimates of the underlying value at a series of points. The estimate of the value at any point is produced by using weighted least squares to fit a straight line to the observations, and then evaluating that line at the point in question. Observations with x values near the point in question get more weight that those far away, and the -width # parameter controls this. The weight given to an observation for a point at x when working on a point at p is exp(-(x-p)2/2w2), where w is the width, so the dimensions of the width are those of the x-coordinates, and large widths produce a lot of smoothing. If the points are regularly spaced, about 70% of the weight should be given to observations within w of the point at which the fitted line will be evaluated, and if you were to plot the weight function you would find that it was symmetrical, with a peak at the central point that we are trying to fit, and points of inflection at +/-w. -sig # gives the sigmage to use for the confidence limits, so the default of 2.5758 should give approximately a 99% confidence limit; that is, the limits will be chosen using a rule that will place the confidence limits on either side of the underlying value 99% of the time. In practice, you may wish to use a smaller sigmage to get narrower confidence limits, accepting the fact that is then less likely that the confidence limits will in fact enclose the underlying value.

The parameters -postFits # and -seed # allow you to check to see if its likely that ths sigmage you have specified has given you confidence limits that are as reliable as they should be. After the program has processed the real observations, it makes up random observations for which it knows the true underlying pi, fits them by the same process, and looks to see if the fitted confidence limits do in fact contain the value you would get if you smoothed the true underlying pi. The number given to -postFits is the number of such pass to make, where each pass amounts to generating a set of random data and then fitting a line through it. The number given to -seed is the seed for the random number generator to be used. Two sorts of random data are in fact generated; one uses the fitted curve from the real data, and one uses estimates at each point of the observed data, estimating the probability of success there as (score + 1) / (obs + 2), where score is the total at that point and obs the number of observations there.

The parameter -testFits # gives the number of internal tests of the fitting process to make, before the data is seen.

Xsl

This is a test/utility class to parse and then optionally validate XML (against either a DTD or an XSD schema, or indeed both), and then optionally pass the result through an XSL transform.

Home

Here is a link back up to my home page.