Using neural networks for prediction of the subcellular location of proteins

Using neural networks for prediction of the subcellular location of proteins
Introduction
Materials And Methods
   The database
   The neural network
   Cross-validation testing
   Weighted training
   Applying a reliability index
   Calculation of prediction accuracies
Results
   Pairwise neural network prediction accuracy
   General prediction of subcellular location
   Tests on new data
Discussion
Acknowledgements
References

Medline entry

A. Reinhardt*, T. Hubbard

The Sanger Centre, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Received October 16, 1997; Revised and Accepted March 9, 1998

ABSTRACT

Neural networks have been trained to predict the subcellular location of proteins in prokaryotic or eukaryotic cells from their amino acid composition. For three possible subcellular locations in prokaryotic organisms a prediction accuracy of 81% can be achieved. Assigning a reliability index, 33% of the predictions can be made with an accuracy of 91%. For eukaryotic proteins (excluding plant sequences) an overall prediction accuracy of 66% for four locations was achieved, with 33% of the sequences being predicted with an accuracy of 82% or better. With the subcellular location restricting a protein's possible function, this method should be a useful tool for the systematic analysis of genome data and is available via a server on the world wide web.

INTRODUCTION

Within the last couple of years the complete sequence has been determined for a number of genomes (1,2). This has created the need for fully automated methods to analyse the vast amount of sequence data now available. The assignment of a function for a given protein has proved to be especially difficult where no clear homology to proteins of known function exists (3). Knowing the subcellular location that a protein resides in may give important insights as to its possible function, making an automated method that assigns proteins to a certain subcellular location a useful tool for analysis. For example, a strong location prediction may help to distinguish between a number of alternative functional predictions for a protein. Even when the basic function of a protein is known, knowing its location in the cell may give insights as to which pathway an enzyme is part of. As previous studies have shown (4), intra- and extracellular proteins differ significantly in their amino acid composition and these differences are strong enough to be used as the basis for a prediction method. However, to be useful for genome analysis a larger number of subcellular locations need to be distinguished.

This study examines whether the differences in amino acid composition between other subcellular locations is strong enough to establish a prediction method. As yet only two automatic methods for assignment of the subcellular location are publicly available. One of these (5) does not distinguish intracellular proteins as cytoplasmic or mitochondrial and handles eukaryotic and prokaryotic sequences together, while the other is based on an expert system, strongly relying on the existence of targeting or leader sequences (6,7). In large genome analysis projects genes are usually automatically assigned and these assignments are often unreliable for the 5[prime]-regions. For Caenorhabditis elegans, for example, automatic assignment methods alone predict <70% of the start codons correctly (S.J.M.Jones, personal communication). This can lead to leader sequences being missing or only partially included, thereby causing problems for prediction algorithms depending on them. A method based on the amino acid composition should be comparatively stable to this sort of ambiguous assignment.

Initial trials using standard statistical methods for prediction (e.g. Mahalanobis distance; 8) did not yield satisfying results, as cross-validation showed a large variation in prediction accuracy. This method has previously been shown to be sensitive to noise within the data set (9). Neural networks on the other hand have been shown to be reliable tools for protein structural prediction purposes (10), so it was decided to apply them in this study.

MATERIALS AND METHODS

The database

Sequences whose subcellular location was annotated were extracted from release 33.0 of the SWISSPROT database (11). Subcellular location annotation was found for 15 775 out of 52 205 sequences in this release. This set of sequences was filtered to remove: sequences that were annotated as fragments of larger proteins; sequences that contained ambiguities (such as amino acids denoted by X within the sequence); sequences which were annotated as residing in more than one subcellular location; annotations made by similarity or marked as probable or possible concerning the subcellular location; i.e. essentially sequences were only kept if they appeared complete and had what appeared to be reliable location annotations coming direct from experiment. For this study transmembrane proteins were also excluded, as reliable prediction methods for this group of proteins already exist (12). It has also been shown that the extra- and intracellular domains of transmembrane proteins differ in their amino acid composition as do whole proteins (13) and therefore do not need to be considered as a separate compartment. Plant sequences were also excluded, as initial tests showed that their composition appears to be sufficiently different to have a negative influence on prediction accuracy for eukaryotic proteins (plant sequences were predicted at an accuracy of 20-30% lower than other eukaryotic proteins by a neural network trained on a combined data set). As not enough sequences for plants within the various subcellular locations exist, it was not possible to treat them as an independent group. After these steps 5134 sequences remained (9.8% of the whole release). The sequences were divided into 11 different groups according to their subcellular location and whether they belonged to eukaryotic or prokaryotic species. Within each group the sequence identity was calculated between all pairs and sequences were kept such that none had >90% sequence identity to any other. This was done to avoid a bias towards large sequence families with high similarity. Overall 3420 sequences remained, distributed over the 11 groups as shown in Table 1.

Table 1. Number of sequences within each subcellular location group

Location Number of sequences

Cytoplasmic (eukaryotic) 684

Cytoplasmic (prokaryotic) 687

Extracellular (eukaryotic) 325

Extracellular (prokaryotic) 105

Glycosomal 9

Glyoxysomal 21

Lysosomal 15

Mitochondrial 321

Nuclear 1097

Periplasmic 201

Peroxisomal 66

Number of sequences in the 11 different subcellular locations that were distinguished for analysis. The glycosomal, glyoxysomal, lysosomal and peroxisomal groups were considered to contain too few data to be statistically analysed.

As can be seen from Table 1, for four of these groups the amount of data available is too small for a statistical analysis to be performed. As this leads to the exclusion of only 3.2% of all sequences in this database, a distinction between the remaining groups should still prove useful for analysis. Once the number of sequences available for the excluded groups becomes large enough for statistical analysis they can be included in the prediction method. To provide a further independent data set the above procedure was performed on sequences which first appeared in SWISSPROT releases 34 and 35. This yielded another 749 eukaryotic and 243 prokaryotic sequences. A list of the sequences within each group is available on request.

The neural network

The Stuttgart Neural Network Simulator (14) was used to build and train all the neural networks used.

Two different types of neural networks were used in prediction. A simple fully connected architecture with 20 input units, one for the fraction of each amino acid, and two output but no hidden units was used for predictions that distinguish between two possible locations. Each input unit was connected to each output unit. An output scheme of {1, 0} or {0, 1}, indicating one or the other location was selected, which made it possible to use the difference between the values of both output units as a reliability measure.

Two more general neural networks, predicting a sequence as belonging to one of three locations for prokaryotic or one of four locations for eukaryotic sequences, were built with a somewhat more complex architecture. Each consisted of 20 input units and three and four units in a hidden layer for prokaryotic and eukaryotic sequences respectively. Extensive tests showed that this number of units in the hidden layer yields optimal results (see Results). The number of output units matched the number of possible locations. Each input unit was connected to each hidden unit as well as each output unit. Also, each hidden unit was connected to each output unit. Again, a coding scheme for the output was chosen which assigns 1 to the correct location and 0 to all other locations. A standard back propagation algorithm was used during training, with [eta] = 0.2.

Cross-validation testing

Using neural networks three data sets are needed to perform a jack-knife test. While the neural network learns from a training set, a test set is used to determine when the training process has to be stopped. As during this procedure the information within the test set is implicitly used, a third completely independent data set is needed to evaluate the prediction accuracy of the trained neural network. Accordingly, all data sets were split into three equally sized subsets. To provide cross-validation the sets were used for training, testing and evaluation in every possible combination, yielding six different neural networks. The overall prediction accuracy was determined as the average of the prediction accuracies of all six neural networks.

Weighted training

To prevent a bias of the neural network a weighted training has to be performed. The same number of sequences for each location has to be presented to the network during training. This causes a problem, as some of the groups are considerably larger than others. To include all information given in large groups some of the sequences in small groups have to be used repeatedly. This is done by first splitting into three subsets for training, testing and evaluation and then repeatedly using sequences within the subset.

Applying a reliability index

As the output nodes of the neural networks have values between 0 and 1, the difference between the highest and next highest node ([Delta]₀) can be used as a reliability index for a prediction. Reliabilities were binned in five groups (with ascending reliability index) for analysis: 0 < [Delta]₀ < 0.2; 0.2 < [Delta]₀ < 0.4; 0.4 < [Delta]₀ < 0.6; 0.6 < [Delta]₀ < 0.8; 0.8 < [Delta]₀.

Calculation of prediction accuracies

The prediction accuracies quoted throughout are the average of the accuracies determined for each subcellular location independently. This procedure is necessary because a weighted training of the neural networks for the cross-validation tests was performed. As a result, their prediction accuracy weights each subcellular location equally, regardless of the number of sequences within the group. Accuracies should therefore be compared with random values of 50% for 2 states, 33.3% for 3 states and 25% for 4 states.

RESULTS

Pairwise neural network prediction accuracy

The average fraction for each amino acid and its standard error was calculated for all subcellular locations which featured enough data for analysis (data not shown, but can be found at <URL http://predict.sanger.ac.uk/nnpsl/aminoacidcomposition.html>). Only phenylalanine (F), histidine (H), methionine (M) and tryptophan (W) show minor fluctuations, while the other amino acids show strong differences between different subcellular locations. Although the fractions of some amino acids are similar between locations [e.g. A, D, E, F, G, H, L, M, N, T, V, W and Y between extracellular (Eu) and mitochondrial], other amino acids differ substantially (e.g. C, I, K, P, Q, R and S). No uniform behaviour distinguishing eukaryotic from prokaryotic sequences is apparent except for alanine, for which prokaryotic proteins show a clearly higher average fraction than eukaryotic proteins. Overall the differences in the amino acid composition between all groups appear strong for prediction purposes. To determine how effective a measure the amino acid composition is, neural networks were trained to distinguish subcellular locations in a pairwise manner. Their prediction accuracy, as determined by cross-validation, varies from 74 to 94%, as shown in Table 2.

Table 2. Prediction accuracies achieved for the prediction of all subcellular locations against eachotherThe accuracy achieved by neural networks in predicting the subcellular location using only the amino acid composition as input. For each prediction accuracy the standard deviation in percent is given as yielded from cross validation tests.

Neural networks distinguishing between a subcellular location of eukaryotic origin on the one hand and of prokaryotic origin on the other tend to achieve a very high prediction accuracy, showing that eukaryotic and prokaryotic organisms exhibit substantial differences with respect to their amino acid composition. This makes it necessary to handle them separately, although other studies imply that this is not the case (5).

Comparing subcellular locations of prokaryotic organisms against each other also shows substantial differences between all compartments, with the lowest prediction accuracy at 82.6%. It is worth noticing that the standard deviation (SD), as determined by cross-validation, is especially high in this case. This is most likely due to the fact that for both groups distinguished (extracellular and periplasmic) only a comparatively small number of sequences could be used (see Table 1). Neural networks are known to improve in performance as the amount of data for training increases, so the considerably smaller number of sequences used for training of this specific neural network may have resulted in the network being less stable.

Table 3. Summary of the prediction performances of the neural networks for eukaryotic and prolaryotic sequencesSummary of the prediction accuracy achieved by the neural networks for eukaryotic and prokaryotic sequences. Shown is the overall accuracy and the accuracy for the various reliability groups together with the standard deviation [sigma] as yielded by cross validation tests.

The subcellular locations in eukaryotic organisms seem to be less distinct from each other. Cytoplasmic and mitochondrial proteins in particular appear to show common features, with the neural network distinguishing them only reaching a prediction accuracy of [sim]74%, while the standard deviation for prediction is low (2.6), indicating good convergence of the neural networks trained on the data. The same is true for the neural network distinguishing cytoplasmic and extracellular proteins (prediction accuracy 77%, SD 2.4). The fact that the neural network for extracellular and mitochondrial proteins reaches a prediction accuracy of [sim]83% indicates that cytoplasmic proteins share some features with mitochondrial and some with extracellular proteins, while these features do not overlap. The accuracy for predictions with a high reliability index was considerably higher than the overall accuracy (the prediction accuracy for all sequences) (data not shown).

General prediction of subcellular location

Pairwise networks are interesting to investigate the relative differences between different compartments, however, a practical prediction system requires the ability to distinguish between multiple compartments. Neural networks were therefore built and trained to assign proteins to one of three and four possible subcellular locations for prokaryotic and eukaryotic sequences respectively.

The four different subcellular locations taken into account for eukaryotic proteins were cytoplasmic, extracellular, mitochondrial and nuclear. The overall prediction accuracy reached 66.1%, with individual neural networks scoring between 64.5 and 68.7% ([sigma] = 1.59). The low variation in the accuracy of individual networks indicates that the results are independent of the specific sequences within the training, test and evaluation sets. The accuracy for predictions with a high reliability index was considerably higher than the overall accuracy (the prediction accuracy for all sequences), as can be seen in Table . This compares with an accuracy of 25.0% from random guesses for four locations with a balanced set of data as considered here. For comparison, 66.1% corresponds to a slightly higher real life prediction accuracy of 67.2%, which includes the bias from the different fraction of proteins in each location in the current database.

Testing various neural networks with different numbers of units within the hidden layer reveals that the network with no hidden units not only performs considerably less well (overall prediction accuracy of 61.7%), but also deviates from the behaviour observed for networks with hidden units if the cumulative prediction accuracy for the reliability index groups is plotted against the cumulative number of sequences within each group (specificity against sensitivity), as can be seen in Figure 1a. This indicates that further information is gained through the introduction of hidden units into the neural networks. Networks with hidden units show a uniform behaviour, which converges for networks with three or more hidden units (overall prediction accuracy for three to nine hidden units varies only from 65.8 to 66.3%). However, changes in the distribution of sequences over groups with different reliability indices makes the neural network with four hidden units the best choice, as it performs slightly better for groups with the highest reliability index, achieving a prediction accuracy of 82.5% for 21.5% of all sequences and predicting a further 11.7% of the sequences at an accuracy of 81.9%. As these groups are the most useful for practical prediction purposes, the neural network architecture with four units in the hidden layer was chosen for further use, although it doesn't feature the highest overall prediction accuracy. A further plot of specificity against sensitivity for the best network subdivided by location is shown in Figure 1b. It can be seen that the accuracies are best for extracellular and nuclear proteins, but that this is not an effect of data size, since the extracellular group has the second smallest data size and is more likely to reflect the strength of the signal for different locations.ab