This dissertation considers strategies for finding plant species with desired properties, such as the ability to withstand adverse conditions, to grow particularly tall, or to produce useful substances (such as drugs). One way to do this, without exhaustively screening all available species, is to use a first random sample, together with information recorded on all available species, to select a second sample for screening. This process has been simulated using a number of different datasets, with a variety of different strategies. Examples have been found where one of these strategies did better than random sampling. Examples have also been found where plausible strategies did worse than random sampling. Theoretical results have been obtained showing that random sampling is a minimax strategy for this problem, and emphasising the dangers of selecting a strategy solely on the basis of its performance on a single dataset. It seems likely that choice of strategy will require problem-specific information and experience, and not the simple application of an automatic rule to the first random sample.
As stated under generalisability, because our datasets do not constitute a random sample from any interesting population, the best that can be offered as conclusions are general statements based on the experience gained during the project. We can't even place much emphasis on the existence of successful strategies as examples, because we have demonstrated that if you search hard enough, you can always find successful strategies, even for completely random datasets, and that for the problem considered here you don't have to search very hard to find surprisingly successful strategies.
Examples have been found where searches for plants with particular properties could be made more productive by using information already known on other properties of those plants, and a small preliminary sample of plants where the values of the required attribute were known, to guide the selection of a second sample.
The most striking examples of this occur when the attributes used to guide the search are closely related to the attribute searched for: for instance, if they can all be regarded as measures of the size of the plant. Under these conditions, principal components regression can be useful.
When properties of plants are being used to guide the search, the simplest and most practical method seems to be to use linear regression to fit a model predicting the value of the desired attribute from the values of the attributes that are known for all plants, and then to choose the plants whose predicted value is highest. This produced the most successful strategies for the Grass species culm height (where there is an obvious measure of size) and also for the Angiosperm species parts of perianth, where the first principal component may be a general measure of flower complexity.
Less familiar strategies are explained, and have been tried, but there is little evidence of an improvement large enough in practice to support leaving the familiar ground of linear regression.
Although beaten by principal components regression, one good strategy for the grass species dataset was to compute the mean log height of each genus in the first sample, and to choose the second sample to include species from the genera with the highest mean height. The first sample size was not reduced from 100, so the chance of getting a reasonable coverage of the available genera was quite high. When there are too many different genera to see all of them represented in the first random sample, an alternative is to use cluster analysis to create your own clustering of species, creating however many clusters you think is reasonable. This worked quite well on the Calflora dataset.
There are examples where attempts to guide the search have been shown to be no better than, or markedly worse than, random choice. A strategy of attempting to ensure diversity by picking similar numbers of samples from each higher level grouping did significantly worse than random choice in the datasets considered here. With the sample sizes studied, it is not easy to decide which attributes should be considered in linear regression from the data provided alone, or whether a particular strategy is likely to be effective, without trying it out in practice. It seems likely that information specific to the particular problem, such as experience of the relationship of the property being searched for to other properties, or of the plants being considered, will have be to brought to bear to decide whether it is worth using the methods considered, in place of random sampling.
When random sampling is used, it is possible to calculate ahead of time the likely rank in the collection of plants being sampled of the best-performing plant in the sample. No such reassurance is provided for the other methods considered here, or likely to be available, in the absence of reliable information on the structure of the dataset.
Measurements such as size are necessarily non-negative, and may vary by at least an order of magnitude. In this case, a normal distribution with matching mean and variance will predict a substantial proportion of -ve values, which is absurd. Large deviations from normality were seen in the angiosperm and grass species data sets (but not in the maximum elevation seen in the Calflora dataset, which would have been better left untransformed).
Estimates of the mean on the original scale may also have an error variance that grows with the estimated mean, which will violate the basic assumptions of linear regression. A family of power transformations is often used to make data more well behaved: see e.g. Chapters 4 and 8 of Hoaglin, Mosteller, and Tukey (1983). One simple alternative to this is to use the ranks of the observations.
Minor errors were found in very respectable datasets, in one case due to a presumed error in a published source document, even though I was capable of spotting little more than obvious inconsistencies. Since they do exist, and since this task of searching for extreme cases will inevitably be affected by them, it is worth checking for them when possible, and remembering their possible existence at all times.