Index of /hyper/mbg/ProtEvol/unpack/ProtEvol_TreeWgt
Name Last modified Size Description
Parent Directory 17-May-1996 15:46 -
README 18-Oct-1993 01:19 3k
db 18-Oct-1993 01:19 1k
dnd 18-Oct-1993 01:19 1k
treewgt.SGI 04-Nov-1993 01:11 53k
treewgt.SUN.gz 18-Oct-1993 01:19 77k GZIP compressed docume>
treewgt.c 04-Nov-1993 01:11 3k
TREEWEIGHT - a program to calculate sequence weights from a tree
User Guide 12 oct 1993
Erik Sonnhammer, esr@sanger.ac.uk
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Contents:
1. Introduction
2. Installation
3. Usage
4. Disclaimer and References
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1. Introduction
If you have a set of related sequences and want to construct a
classifier for them, e.g. a profile, you may have a bias problem.
Your data is likely to be biased towards sequences from well-studied
organisms and pathways. The underrepresentation of certain members of
the family and overrepresentation of others is a common problem which
can lead to poor results. The general solution is to calculate
weights for each sequence in order to downweight the importance of the
overrepresented sequences and upweight the others.
A fast algorithm for calculating sequence weights based on a tree was
described by Gerstein, Sonnhammer and Chothia (1993). Treewgt is an
implementation of their algorithm. Treewgt does NOT construct the
tree itself, but reads an existing tree from a file. Treewgt is
compatible with the output generated from the tree-building program
clustalv (Higgins & Sharp); trees generated from other programs will
need some reformatting.
2. Installation
Files:
README - This file
treewgt.SGI - Silicon Graphics executable
treewgt.SUN - SUN executable
treewgt.c - ANSI C source code
db - example sequence file
dnd - example tree file
If you have a Silicon Graphics running Irix 4 or a SUN running SunOS,
you can use the provided executables directly. Otherwise, compile the
source code with an ANSI C compiler.
3. Usage
See the files db and dnd. They correspond to the example given by
Gerstein et al. (1994) and contain all necessary
information about the sequences and the tree. The format is taken
from clustalv and is described briefly below. The file db contains
the sequence names:
>A
>B
>C
>D
and dnd the tree topology:
80.0 1200
50.0 1120
20.0 1112
Each line in dnd stands for a node in the tree, and the first number
of each line for the percentage identity level of that node. The
numbers to the right are either 0's, 1's and 2's. Each protein
corresponds to one column and the numbers tell what that sequence does
at that node. If the number is 1, the sequence is part of the left
subtree, if it is 2 the sequence is part of the right subtree and if
it is 0 the sequence doesn't take part of that node.
If you have your sequence names and tree in this format, do:
% treewgt db dnd
which gives:
>A 0.76
>B 0.76
>C 1.09
>D 1.39
These are the weight values calculated by treewgt. By default the
average weight will be 1. If you prefer that the weights sum up to 1,
do:
% treewgt db dnd SumTo1
which gives:
>A 0.19
>B 0.19
>C 0.27
>D 0.35
4. Disclaimer and References
No guarantee is given that the weights have a strict biological
meaning. The weights calculated by treewgt should primarily be seen
as a tool to compensate for unequal representation. If you have any
problems, please report them to the author.
Gerstein M., Sonnhammer E.L.L. and Chothia, C.
"Volume Changes in Protein Evolution"
J. Mol. Biol. (in press)
Higgins D.G., Sharp P.M.
"CLUSTAL: a package for performing multiple sequence alignment on a microcomputer"
Gene 73:237-244 (1988)