Suppose that you have some format that exists in multiple versions, with new versions appearing regularly. You want to convert everything you see to the most recent version. The following choices are standard, but can still be unsatisfactory.
Every time you produce a new version, create software to convert directly from all older versions to the new version.
With N versions, there are N(N-1)/2 different combinations of conversion, so providing and maintaining all possible direct conversions will quickly become expensive.
Every time you produce a new version, create software to convert only from the previous most up to date version, and use multiple conversions when necessary.
You now only need N-1 converters, but many conversions will provide multiple conversions - in the worst case N-1 1-step transformations. Perhaps it would be possible to use this as a base from which to automatically generate direct converters, though.
Use some common intermediate format.
The problem with this is that if you had a stable intermediate format, you'ld use it instead of messing around with N different versions of an unstable format. You may not be able to come up with an intermediate format that is any more stable than the format you are constantly having to update.
If you view a system of converters as a network, it is obvious not only that intermediate choices not yet mentioned are possible, but also that lots of people have put a lot of work into this problem. The nodes of the network are the different versions of your format. The links of the network are the conversions you decide to implement directly. You are trying to minimise both the length of each path from one node to another and the total number of links in the network. Just to demonstrate that this approach can lead to new solutions, I will outline a compromise scheme very loosely inspired by what I can remember of some multi-processor routing networks.
In the previous cases you fix either the number of conversion steps required or the number of new converters per version required for all N, but whatever you vary grows at least as fast as N. With this middle way, both the number of conversion steps required and the number of different converters per version required increase, but only about as fast as log2 N. The message of this section is not that this particular scheme (dreamt up in fifteen minutes while trying to sleep in a plane) is the best option, but that there are potentially useful compromises here. In practice, I would expect the real network of conversions to depend a lot more on the particular changes made to the data format than on the simple structure used here.
Number the versions in order, starting at one. This means that every version number requires about log2 N bits. It also means that every version number, written out in binary, contains at least one '1'. In fact each version number consists of some arbitrary sequence of bits, followed by a '1', followed by some number of '0's.
You can think of a conversion as a way to change the binary representation of a version number. You need converters that correspond to simultaneously clearing the lowest '1' bit and incrementing the number formed by the sequence of bits above it. This amounts to clearing the 2n's bit by adding 2n to the version number. You also need converters that correspond to setting any one of a low order string of '0' bits.
This allows you to perform any conversion from an older to a more recent version. You compare the source and destination version numbers, looking for bit differences, and apply the following rules:
These two rules correspond to two phases. The first moves in steps of at least one bit from right to left, and the second moves in steps of at least one bit from left to right. So you need no more than 2n different conversion steps to convert between two n-bit version numbers. That is, between any two versions in the range 1..2n-1.
There are at most k+1 ways to reach any version number with k low order zeros. If you got to it by setting a single bit, that bit must be the lowest order '1' bit, so you know exactly what the previous version was. If you got to it by clearing a low order '1' bit, it must have been where one of the k low order zeros now is. So in the worst case, when you introduce version 2k, You have to write only k converters. You can also see that each conversion step corresponds to adding on a number with only 1 '1' bit in it, and there are at most n of those, not all of which are applicable in every case.
| Version Number | Must convert from |
|---|---|
| 00001 | (nothing) |
| 00010 | 00001 |
| 00011 | 00010 |
| 00100 | 00010, 00011 |
| 00101 | 00100 |
| 00110 | 00100, 00101 |
| 00111 | 00110 |
| 01000 | 00100, 00110, 00111 |
| 01001 | 01000 |
| 01010 | 01000, 01001 |
| 01011 | 01010 |
| 01100 | 01000, 01010, 01011 |
| 01101 | 01100 |
| 01110 | 01100, 01101 |
| 01111 | 01110 |
| 10000 | 01000, 01100, 01110, 01111 |
| 10001 | 10000 |
| 10010 | 10000, 10001 |
| 10011 | 10010 |
| 10100 | 10000, 10010, 10011 |
| 10101 | 10100 |
| 10110 | 10100, 10101 |
| 10111 | 10110 |
| 11000 | 10000, 10100, 10110, 10111 |
| 11001 | 11000 |
| 11010 | 11000, 11001 |
| 11011 | 11010 |
| 11100 | 11000, 11010, 11011 |
| 11101 | 11100 |
| 11110 | 11100, 11101 |
| 11111 | 11110 |
A longest path here is 00001-00010-00100-01000-10000-11000-11100-11110-11111. This moves a single bit from one end to the other and back in 2(n-1) = 8 steps, so the worst case path length in general for this scheme is 2(n-1), not 2n, and this bound will always be achieved.