MUTCOMP / MUTPROF 27.03.00
I have written two programs for the comparison of mutation
profiles. The first one, MUTPROF, is a Monte-Carlo-
Markov-Chain implementation of the hypergeometric test for
sparse 3-dimensional contingency tables. It is designed in the
spirit of Fisher's exact test and allows consideration of
different categories of mutation. The format of the input file is
as follows:
1.line: 3 integer numbers, seperated by spaces, namely
NPOS the number of nucleotide positions (max 200)
NPRO the number of profiles (max 4)
NCAT the number of categories per profile (max 3).
Positions where none of the profiles has a mutation should
be excluded since these provide no information to distinguish
profiles. Thus, 200 positions should do for most comparisons.
If not, profiles have to be split up into smaller chunks. The
maximum number of categories is 3 which suffices to encode the
actual nucleotide change.
lines 2 to NPOS+1. The observed numbers of mutations at each
position, seperated by spaces. One line is entered per position.
Within a line, enter the categories of the 1. profile first,
then the categories of the 2. profile etc.
10 2 3
9 15 18 11 15 12
17 15 10 15 11 7
13 13 4 11 7 3
13 5 7 6 8 10
8 8 6 8 9 12
4 9 13 10 12 15
9 10 12 13 20 11
12 15 13 12 14 9
15 10 7 11 10 4
15 9 9 6 3 9
The other program, MUTCOMP, compares relative mutation rates.
The program requires two input files, each containing the nucleotide
substitution matrix for a given profile. The format of the input
files is as follows.
4 lines, 5 columns. All entries are separated by spaces. The
first four columns contain, as integer numbers, the observed
frequencies of different substitutions. The 5. column contains
the overall frequency of the nucleotide for the respective line.
This can be an integer (i.e. absolute number) or a decimal real
(i.e. relative frequency). The order T,C,A,G is used for coding
and is referred to in the output. In principle, however, the order
does not matter.
0 24 17 2 1234
22 0 16 8 1452
21 13 0 9 1322
14 7 3 0 945
The output gives several statistics and the corresponding p values.
The statistics in the first column compare the overall mutability
of the respective nucleotide between the two profiles. To this end,
the observed frequency is divided by the expected frequency
(5. column of input), and the four resulting values are normed so
that they sum up to unity. The statistic is the difference between
the two values determined for each nucleotide in the two profiles.
The other three statistics in each line compare the relative rates
of each individual substitution event. These are estimated dividing
the observed frequency of a cell by the sum taken over the respective
line. The statistic is the difference between the individual cell
values obtained for each profile. All p values are determined by
bootstrapping over the whole mutation sample (5000 simulations).
Best wishes
MK