lable at ScienceDirect Biochimie 104 (2014) 100e107 Contents lists avai Biochimie journal homepage: www.elsevier .com/locate/b iochi Research paper Prediction of bacterial protein subcellular localization by incorporating various features into Chou's PseAAC and a backward feature selection approach Liqi Li a, 1, Sanjiu Yu b, 1, Weidong Xiao a, Yongsheng Li c, Maolin Li a, Lan Huang b, Xiaoqi Zheng d, *, Shiwen Zhou e, **, Hua Yang a, *** a Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China b Institute of Cardiovascular Diseases of PLA, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China c Institute of Cancer, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China d Department of Mathematics, Shanghai Normal University, Shanghai 200234, China e National Drug Clinical Trial Institution, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China a r t i c l e i n f o Article history: Received 1 May 2014 Accepted 1 June 2014 Available online 11 June 2014 Keywords: Protein subcellular localization Position-specific score matrix Gene ontology PROFEAT Backward feature selection * Corresponding author. Tel./fax: þ86 021 6432428 ** Corresponding author. Tel./fax: þ86 023 6877410 *** Corresponding author. Tel./fax: þ86 023 687746 E-mail addresses:
[email protected], zheng.s
[email protected] (S. Zhou),
[email protected] 1 These authors contributed equally to this work. http://dx.doi.org/10.1016/j.biochi.2014.06.001 0300-9084/© 2014 Elsevier Masson SAS. All rights re a b s t r a c t Information on the subcellular localization of bacterial proteins is essential for protein function pre- diction, genome annotation and drug design. Here we proposed a novel approach to predict the sub- cellular localization of bacterial proteins by fusing features from position-specific score matrix (PSSM), Gene Ontology (GO) and PROFEAT. A backward feature selection approach by linear kennel of SVM was then used to rank the integrated feature vectors and extract optimal features. Finally, SVM was applied for predicting protein subcellular locations based on these optimal features. To validate the performance of our method, we employed jackknife cross-validation tests on three low similarity datasets, i.e., M638, Gneg1456 and Gpos523. The overall accuracies of 94.98%, 93.21%, and 94.57% were achieved for these three datasets, which are higher (from 1.8% to 10.9%) than those by state-of-the-art tools. Comparison results suggest that our method could serve as a very useful vehicle for expediting the prediction of bacterial protein subcellular localization. © 2014 Elsevier Masson SAS. All rights reserved. 1. Introduction Subcellular localization is a fundamental biological attribute of proteins. Determination of protein subcellular localization (PSL) can provide valuable information in explaining protein functions, elucidating the interactions between different proteins and other molecules, and understanding the mechanisms of human diseases. For example, determination of bacterial PSL is used for screening and prioritizing proper bacterial proteins as drug targets [1]. Nowadays, with the development of high-throughput technology, a huge number of protein sequences are increasingly identified and piled up into public biology databanks. According to the statistical release in February 2014, UniProtKB/Swiss-Prot contains 542,258 sequence entries, but the number was just 3939 in 1986. Thus, 4. 5. 05.
[email protected] (X. Zheng), (H. Yang). served. identification of PSL which was merely from experimental approach turns out to be a resource intensive, time-consuming and impractical task. Therefore, developing reliable and effective computational methods is an urgent need for the PSL prediction. Many computational methods have been developed to identify PSL in various organisms during the past few years [2e4]. As a typical pattern recognition problem, computational approaches for PSL consist of threemain steps: i) protein feature representation; ii) optimal feature selection; iii) algorithm selection for classification. The most crucial factor is to extract protein features for prediction. Two kinds of features are generally used in this aspect: sequence- based and biological. The former are derived from protein se- quences, including amino acid compositions, N-terminal amino acid sequences, and pseudo-amino acid compositions [5,6]. The latter are derived from the physico-chemical properties of amino acids or obtained by using detection models provided in biological databases, such as PROSITE, Pam and Gene Ontology (GO) [7e9]. As is evidenced in numerous prediction tasks, a carefully engineered and integrated feature model generally offers higher accuracy and stability than those with a single feature. To this end, our principal Delta:1_given name Delta:1_surname Delta:1_given name Delta:1_surname Delta:1_given name Delta:1_surname Delta:1_given name Delta:1_surname Delta:1_given name mailto:
[email protected] mailto:
[email protected] mailto:
[email protected] mailto:
[email protected] http://crossmark.crossref.org/dialog/?doi=10.1016/j.biochi.2014.06.001&domain=pdf www.sciencedirect.com/science/journal/03009084 http://www.elsevier.com/locate/biochi http://dx.doi.org/10.1016/j.biochi.2014.06.001 http://dx.doi.org/10.1016/j.biochi.2014.06.001 http://dx.doi.org/10.1016/j.biochi.2014.06.001 Table 1 The detailed information of three datasets in our predictor. M638 Gneg1456 Gpos523 Subcellular localization Number of proteins Subcellular localization Number of proteins Subcellular localization Number of proteins Cytoplasmic 265 Cell inner membrane 557 Cell membrane 174 Integral membranes 314 Cell outer membrane 124 Cell wall 18 Secretory 29 Cytoplasm 410 Cytoplasm 208 Attached to the membrane 30 Extracellular 133 Extracell 123 Fimbrium 32 Flagellum 12 Nucleoid 8 Periplasm 180 Total 638 Total 1456 Total 523 L. Li et al. / Biochimie 104 (2014) 100e107 101 features were constructed by integrating the features from physico- chemical properties of amino acids, PSI-BLAST profile and GO an- notations of proteins. After feature representation, selecting a proper and reliable classification algorithm is a crucial step to obtain a satisfactory result. In recent decades, many computational algorithms have been proposed to predict PSL [10e12], such as k-nearest neighbor, hidden Markov model, Artificial Neural Network and Support Vector Machine (SVM). Among them, SVM is particularly attractive for prediction analysis due to its computational efficiency in pro- cessing multidimensional datasets with complex relationships among the data elements [13]. Moreover, SVM is readily adaptable to new data, allowing model updates in parallel with the continuing increase of biological databases. Thus it has beenwidely applied inmanymodels for identifying PSL, such as CELLO, PSLpred, PSORTB3 [14e16], and as the prediction algorithm in our model. Before the training of the SVM, a feature selection tool is needed to eliminate noise of the integrated features and obtain informative features from initial vectors. The selection techniques can be organized into three categories: filter, wrapper and embedded methods [17]. In most cases of filter selection tools, such as Infor- mation gain, Euclidean distance, T-test and c2-statistics, a feature relevance score is calculated, and low-score features are removed. While focusing on ranking individual features, these tools often ignore the feature dependencies, resulting in poor classification performance. On the other hand, wrapper selection tools, such as Sequential forward selection (SFS), fast correlation-based filter (FCBF), and Genetic algorithms, could rank the values of features as subsets. But they are very computationally intensive and have a high risk of overfitting. As a powerful embedded method, SVM-REF could avoid these problems by taking feature correlations into ac- count and discretely removing only one feature from the whole feature vectors. Thus it is muchmore robust to data overfitting than other feature selection techniques [18], and was adopted in our model. In this work, an SVM-based model was developed to further improve the prediction of bacterial PSL with recursively selecting features from PSI-BLAST profile, physicalechemical properties and protein functional annotations. Before inputted to an SVM classifier to perform the prediction, the original profile was transformed by a simple but well-established sequence representation model. The prediction quality was examined by jackknife tests on three widely used benchmark datasets. The results indicate that our method yields significant improvements in predictive accuracies compared to other existing methods. As summarized in a comprehensive review [19] and conducted in a series of recent articles [20e27], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to deal with these steps one-by-one. 2. Materials and methods 2.1. Datasets In this study, three benchmark datasets were used to evaluate the performance of the proposed method. The first is M638 [28], which includes 638 proteins with 40% pairwise sequence identity. The second dataset is Gneg1456 [1], including 1456 locative pro- teins (1392 different proteins). As described in many refs (e.g., [2,29,30]), to avoid homology bias but meanwhile cover as many locations as possible, we conducted a sequence identify cutoff procedure tomake sure no sequence has�25% sequence identity to others in the same subcellular locations. A third dataset, Gpos523 [31], consists of 523 Gram-positive bacterial protein sequences from 519 different proteins and has less than 25% pairwise sequence similarity within each subcellular localization. Table 1 lists the detailed information of these three datasets. 2.2. Feature preparation To develop a powerful predictor for a protein system, one of the keys is to formulate the protein samples with an effective mathe- matical expression that can truly reflect their intrinsic correlation with the target to be predicted [19]. To realize this, the concept of pseudo amino acid composition [32] or Chou's PseAAC [33] was proposed to replace the simple amino acid composition (AAC) for representing the sample of a protein. Ever since the concept of PseAAC was introduced, it has been widely used to study various problems in proteins and protein-related systems (see, e.g., [11,28,34e44]). For various different modes of PseAAC, see Ref. [45]. Because it has beenwidely and increasingly used, in addition to the web-server ‘PseAAC’ [46] built in 2008, recently three powerful open access soft-wares, called ‘PseAAC-Builder’ [47], ‘propy’ [48], and ‘PseAAC-General’ [49], were established: the former two are for generating various modes of Chou's special PseAAC; while the 3rd one for those of Chou's general PseAAC. According to a recent comprehensive review [19], the general form of PseAAC can be formulated as (see Eq. (6) of [19]): P ¼ ½j1j2…ju…jU�T (1) where T is a transpose operator, while the subscript U is an integer and its value as well as the components j1, j2, … will depend on how to extract the desired information from the amino acid sequence of P. Here, we are to use a combination of evolutionary information, physicochemical/structural features, GO information to represent the protein samples via PseAAC of Eq. (1). ThusU is the length of the initial feature vector for P by combining the above features. 2.2.1. Evolutionary information from PSI-blast profile Evolutionary information extracted from PSSM profile was chosen as the feature descriptor in this study. The PSSM profile for Fig. 1. Top33 features in the three datasets. L. Li et al. / Biochimie 104 (2014) 100e107102 each protein was generated using PSI-BLAST by searching the protein against the non-redundant (NR) database obtained from NCBI. The parameters j and h were set to 3 and 0.001, respectively. The PSSM elements were mapped to the range of [0, 1] according a standard sigmoid function as followed: f ðxÞ ¼ 1 1þ e�x (2) where x represents the original PSSM value. Next we employed the linear predictive coding (LPC) to opti- mally parameterize the signal. The PSSM profile for each sequence is an L � 20 matrix, where L is the length of the corresponding sequence. For each column in PSSM, LPC was used to extract p features which represent the order of the prediction filter poly- nomial. Then each PSSM profile could be transformed to a 20 � p feature vector for each protein. The value of p here reflects the rank of the Markov model. According to the scale of protein datasets in this study, the value of p was generally considered to be no more than 10. But it would lose much sequence-order information if the value of p was too small. Therefore, we chose the value of p from Refs. [3,10]. 2.2.2. Physiochemical properties and structural features by PROFEAT PROFEAT is employed for computing commonly used physi- cochemical and structural features of proteins and peptides from amino acid sequence [50e52]. These computed features include normalized MoreaueBroto autocorrelation, Geary autocorrela- tion, Moran autocorrelation, dipeptide composition, sequence- order-coupling number, quasi-sequence-order descriptors and various physicochemical properties and structural features. In this study, by uploading the primary sequence of a query protein and selecting all the PROFEAT features, we finally obtained a 1080-dimension vector of PROFEAT feature for each query protein. 2.2.3. Gene ontology formulation Gene ontology data are available from the UNIPROT GO database (released on March 2, 2014). We first searched all accession numbers in three datasets against the GO database to find the corresponding GO numbers. Note that current available GO terms did not cover all proteins. Hence for protein P without known GO terms, we used BLAST to search its homologous pro- teins under the expect parameter E � 0.001, and collected pro- teins with �60% pairwise sequence similarity to P. Then, we used the geometrical center of these homologous GO features to represent protein P. After this step, 996, 1597, and 819 different GO terms were obtained for M638, Gneg1456 and Gpos523 datasets, respectively. To simplify the protein representation in all datasets, we created a vector to represent the GO terms for each protein as described in Ref. [53]. Finally, each protein entry was represented as a feature vector of the dimensions 996, 1597, and 819 for M638, Gneg1456 and Gpos523, respectively. Due to the low pairwise sequence similarity and large population to boost the statistical power, in this study, Gneg1456 was chosen to optimize the parameters in LIBSVM, and implemented to predict the subcellular location of a query protein. In order to ensure the consistence of our model among the three datasets, feature extraction was based on only one dataset. Despite the different numbers of GO features in the three datasets, SVM-RFE ranked separately the order of integrated features in each dataset. Then the optimal dimensions of top feature vectors were calculated based on Gneg1456, and therein consistent among three datasets. 2.3. Backward feature selection The representation of each protein described above has thou- sands of features, which lead to a high computation cost inmachine learning. Moreover, the noise in the data also degrades the pre- diction performance. In order to find out the informative features and reduce the computation cost, we utilized backward feature selection to rank the features. The initial feature vector for each protein was constructed by combining PSSM, PROFEAT and GO features. For each dataset, feature vectors of all proteins constituted a feature matrix, where each row corresponded to a sample and each column corresponded to a feature. Then, SVM-RFE was implemented by training an SVMwith a linear kernel on the feature matrix. Finally, we got top K features by eliminating a plurality of features corresponding to the smallest ranking criteria and applied them in sequel. 2.4. The SVM ensemble classifier The support vector machine (SVM) was adopted in this study. The basic idea behind SVM is to represent an example as a point in a high dimensional feature space and then predict it to belong to a category based on the optimal separating hyperplane [54,55]. In this study, we employed the package LIBSVM 3.17 to conduct the SVM prediction [56,57]. The top feature vectors were obtained based on backward feature selection and were used for training one-versus-one SVMs. Thus, for a Q-class problem, (Q�1)�Q/2 in- dependent binary SVMswere trained. As the performance of SVM is affected by the selection of kernel function, the most popular radial basis function (RBF) kernel was adopted for its good general per- formance and the few parameters. Two free parameters (parameter g of RBF kernel and regularization parameter C) needed to be optimized based on the Gneg1456 dataset through a grid search strategy. However, feature vectors optimized by different datasets may also have subtle difference (Fig. 1). Finally, the SVM module predicted the subcellular location of a protein using the top features and the optimal combination of the two parameters. 2.5. Evaluation parameters To evaluate these parameters, we applied jackknife test [20,58] in this prediction model. In brief, each protein in the learning dataset was singled out in turn as a test protein. The classifier was L. Li et al. / Biochimie 104 (2014) 100e107 103 then trained with data from the rest proteins and tested on data from the protein singled out. Three widely used measures were used to estimate the performance of our method, which were: ACC (accuracy), OA (overall accuracy) and MCC (Matthews correlation coefficient) [59,60]. The measures were calculated as: ACCi ¼ TPi Vi � 100% (3) OA ¼ PQ i¼1 TPi V � 100% (4) MCCi ¼ TPi � TNi � FPi � FNiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðTPi þ FPiÞðTPi þ FNiÞðTNi þ FPiÞðTNi þ FNiÞp (5) Here, V and Vi respectively represented the numbers of pro- teins in total and in class i. Q was the class number. TPi, FPi, TNi and FNi denoted true positives, false positives, true negatives, and false negatives in class i, respectively. It is instructive to point out that the above equation set is often used in literature [20e22,24,25,61,62] for examining the performance quality of a predictor. The set of metrics is valid only for the single-label systems. For the multi-label systems whose existence has become more frequent in system biology [5,63] and system medicine [64,65], a completely different set of metrics as defined in Ref. [66] is needed”. To provide an intuitive picture, a flowchart was provided in Fig. 2 to illustrate the prediction process of our method. 3. Results and discussion 3.1. Parameter selection In this study, a grid search strategy was implemented in the Gneg1456 dataset for selection of parameter g of RBF kernel and regularization parameter C in LIBSVM. Both of the two parameters depend on the dimension Dim of top feature vector. Firstly, we constructed an initial feature vector by combining the lpc3, lpc4, lpc5, …, lpc10, PROFEAT, and GO features. Then, according to their m � n � ¼ � 1; if all the subcellular locations of the nth protein are exactly predicted without any overprediction or underprediction 0; otherwise (7) importance, a ranking list of all the features was returned based on backward feature selection. After this step, we calculated the pre- diction accuracies for top N features, where N ¼ 10 � 2n�1 (n ¼ 1, 2, 3, …, 9), and found that the accuracy at top40 (n ¼ 3) was the highest for Gneg1456 dataset. So the best overall prediction accu- racy would appear around top40. Therefore, the further feature selection was evenly chosen in a small range from top20 to top80 (correspond to n ¼ 2 to n ¼ 4, respectively) with the step of 1. Results showed that the accuracies in the range of [top33, top49] were all the highest (Fig. 3). As the dimension of top33 was the smallest among them, thus we selected top33 as the best selected features. Therefore, top33 features and the corresponding param- eters (C ¼ 0.5, g ¼ 0.0078125, and Dim ¼ 33) were selected as the optimal parameter group to calculate the accuracies for all three datasets. As shown in Fig. 1, PROFEAT consistently makes up the majority of Top33 features in each dataset, followed by GO and PSSM in turn. The numbers of PROFEAT features in the Top33 selected features weremore than 15 for all datasets. For instance, the number was up to 20 for M638. These results indicated that the subcellular locali- zation of a protein could be characterized by the physicalechemical properties reflected by PROFEAT features. 3.2. Comparison with other methods In this study, all the three datasets were used to test our method. We first compared the performance of our method on dataset M638 with the method proposed by Fan and Li [28]. As shown in Table 2, our method achieved an overall accuracy of 94.98%, which was 7.21% higher than that produced with the Fan and Li's. With respect to accuracy and MCC of a specific localization, our method also performed better in most localization sites. For Gneg1456, our method achieved an overall accuracy of 93.21%, which was higher than those achieved using methods listed in Table 3 (from 1.81% to 7.51%). For Gpos523, our method achieved an accuracy of 94.57% and also outperformed all other method listed in Table 4 (from 1.45% to 12.37%). Then we implemented receiver operating characteristic (ROC) curves on three datasets since ROC curvewas applicable to evaluate the prediction performance of a binary classifier. However, protein subcellular location prediction was a multi-class prediction prob- lem. To solve this problem, we first transformed protein subcellular location prediction to multiple binary classifiers using one-versus- rest strategy, and then averaged all the binary ROC curves as the final output of a method. Fig. 4 showed the averaged ROC curves for Gneg1456 by this method and the other three approaches. The area under curve (AUC) of this method was 0.9882, which was signifi- cantly higher than those by PSSM, PROFEAT and GO features indi- vidually (AUCs are 0.7969, 0.7857 and 0.8115, respectively). Similar results were obtained for the other two datasets (Figs. S1eS2). To provide a more strict measure, we introduce “absolute true overall accuracy”, which was defined by L ¼ PV 0 n¼1 mðnÞ V 0 (6) Here, L was the “absolute true overall accuracy”. V 0 was the number of total proteins investigated. According to the above definition, onlywhen all the subcellular locations of a query protein n were correctly predicted, m(n) could equal 1. However, even if using such a stringent measure on Gneg1456, the “absolute true overall accuracy” achieved by this method was 1322/1456 ¼ 90.8%, which was still 0.9% higher than that by iLoc-Gneg. As is well known, proteins may simultaneously exist at, or move between, two or more different subcellular locations. Proteins with multiple locations or dynamic feature of this kind are particularly interesting because they may have some very special biological functions intriguing to investigators in both basic research and drug discovery. But in this study we didn't cover the case of multiplex proteins because the number of multiplex proteins in the organism investigated by us is not large enough to construct statistically Fig. 2. The pipeline that goes from the query sequence to the final output and all intermediate steps. Table 2 Prediction performance comparisons based on the dataset M638. Subcellular localization Support vector machine by Fan and Li [28] The proposed method Jackknife Jackknife Accuracy (%) MCC Accuracy (%) MCC Attached to the membrane 97.49 0.695 0.00 e Cytoplasmic 92.48 0.847 100.00 0.9936 Integral membranes 89.18 0.784 100.00 0.9101 Secretory 96.39 0.481 93.10 0.9633 Overall accuracy 87.77 e 94.98 e L. Li et al. / Biochimie 104 (2014) 100e107104 meaningful benchmark datasets for studying the case of multiple locations. While the follow web-servers, including iLoc-Euk [63], iLoc-Hum [5], iLoc-Plant [67], iLoc-Gpos [31], iLoc-Gneg [1], and iLoc-Virus [68], can be used to cope with the multiple location Fig. 3. Comparison of prediction acc problems in eukaryotic, human, plant, Gram-positive, Gram-nega- tive, and virus proteins, respectively. 3.3. Case study We next predicted the subcellular locations of 17 intestinal bacterial proteins using our new method. These proteins may induce disruption of intestinal epithelial barrier and involved in pathogenesis of inflammatory bowel disease and even colorectal cancer. As shown in Table 5, all 17 proteins were correctly predicted to the right subcellular locations by our predictor based on Gneg1456. For example, O34163 is an extracellular protein, which is able to induce colonic lesions and to disrupt the integrity of epithelial cell monolayers. In our study, this protein was consis- tently predicted as an extracellular protein by our predictor on Gneg1456 and Gpos523. Another example is P35672, an outer membrane protein, which is involved in the cancer cell invasion in the intestinal epithelium, and could be necessary for the export of invasion related determinants. Our predictor training by Gneg1456 also correctly predicted it as an outer membrane protein. These uracies of different top features. Table 3 Prediction performance comparisons based on the dataset Gneg1456. Subcellular localization Gneg-mPLoc by Shen and Chou [70] iLoc-Gneg by Xiao et al. [1] The proposed method Jackknife Jackknife Jackknife Accuracy (%) Accuracy (%) Accuracy (%) MCC Cell inner membrane 94.3 96.8 98.38 0.9654 Cell outer membrane 84.7 83.1 91.87 0.9277 Cytoplasm 87.1 89.5 89.88 0.9102 Extracellular 59.4 86.5 95.38 0.9615 Fimbrium 87.5 93.8 93.75 0.9514 Flagellum 0.0 100.0 0.00 e Nucleoid 0.0 50.0 0.00 e Periplasm 85.6 89.4 94.41 0.9643 Overall accuracy 85.7 91.4 93.21 e L. Li et al. / Biochimie 104 (2014) 100e107 105 results indicate that our method is appropriate for the bacterial PSL prediction. 4. Conclusions In this work, we proposed an SVM-based backward feature se- lection scheme for predicting bacterial PSL by selecting the optimal Table 4 Prediction performance comparisons based on the dataset Gpos523. Subcellular localization Gpos-mPLoc by Shen and Chou [71] iLoc-Gpos by Wu Jackknife Jackknife Accuracy (%) Accuracy (%) Cell membrane e 95.98 Cell wall e 66.67 Cytoplasm e 95.19 Extracell e 89.43 Overall accuracy 82.2 93.12 Fig. 4. The ROC curves o features from three kinds of important features, i.e., protein GO function annotation, amino acid physicalechemical properties and PSI-BLAST profile. Prediction results show that our proposed model achieves high prediction accuracies for all the three low similarity datasets, thus supporting the assumption that a carefully engi- neered integrated feature model is a strikingly effective way to improve the prediction performance. Besides, the results show that the backward feature selection could grasp the combination effects among different features, making our proposed method powerful and reliable for PSL prediction. Admittedly, some challenges remains to be addressed in our method. For example, during the feature ranking, only the weakest feature was removed from the feature list in each elimination step, so the proposed method suffering from very high computational complexity. Features selected by different benchmark datasets also have slight differences. But it could not affect the prediction results of our predictor severely. Since most datasets have very different location categories, we only show this phenomenon by two data- sets with similar location categories, i.e., Gneg1456 and Gpos523. We predicted the subcellular locations of 15 intestinal bacterial proteins based on these two datasets (Table 5). Results shown that 14/15 proteins are predicted correctly and consistently based on Gneg1456 and Gpos523, which indicates that the selection of training dataset would not affect the prediction results of our predictor seriously. In general, our method could effectively catch et al. [31] PSSM þ RBF by Huang and Yuan [42] The proposed method 5-fold cross Jackknife Accuracy (%) Accuracy (%) MCC e 98.26 0.9781 e 0.00 e e 97.09 0.9675 e 98.35 0.8906 83.66 94.57 e f Gneg1456 dataset. Table 5 Examples to show the predicted results by different predictors. Accession number Entry name Subcellular location Gneg-mPLoc by Shen and Chou [70] iLoc-Gneg by Xiao et al. [1] The proposed method Trained by Gneg1456 Trained by Gpos523 O34163 ACP_BRAHO Extracell Cytoplasm Extracell Extracell Extracell P35672 INVG_SALTY Cell outer membrane Cell outer membrane Extracell Fimbrium Cell outer membrane Cell membrane P07965 HST3_ECOLX Extracell Extracell Extracell Extracell Cytoplasm O82882 STCE_ECO57 Extracell Extracell Extracell Extracell Extracell P22542 HSTI_ECOLX Extracell Cytoplasm Extracell Extracell Extracell Extracell Q8X582 ELFA_ECO57 Fimbrium Extracell Fimbrium Fimbrium Fimbrium e P01559 HST1_ECOLX Extracell Extracell Extracell Extracell Extracell Q47185 HST2_ECOLX Extracell Extracell Extracell Extracell Extracell P07593 HSTA_YEREN Extracell Extracell Extracell Extracell Extracell P23024 TCPA_VIBCL Fimbrium Extracell Extracell Fimbrium e E1W8M5 FOXA_SALTS Cell outer membrane Cell outer membrane Cell outer membrane Cell outer membrane Cell membrane P74977 HSTB_YEREN Extracell Extracell Extracell Extracell Extracell P80672 PORA_CAMJE Cell outer membrane Cell outer membrane Cell outer membrane Cell outer membrane Cell membrane P0A1I4 INVA_SALTI Cell inner membrane Cell inner membrane Cell inner membrane Cell inner membrane Cell membrane A5F7A4 NANH_VIBC3 Extracell Extracell Extracell Extracell Extracell P0C6E9 NANH_VIBCH Extracell Extracell Extracell Extracell Extracell P0A1I3 INVA_SALTY Cell inner membrane Cell inner membrane Cell inner membrane Cell inner membrane Cell membrane L. Li et al. / Biochimie 104 (2014) 100e107106 core features to improve the prediction of bacterial PSL. The future attention could be paid to applying our method to other novel pattern recognition problems, including RNA binding residue pre- diction, drug efficacy prediction and proteineprotein interaction network analysis. Since user-friendly and publicly accessible web-servers repre- sent the future direction for developing practically more useful models, simulated methods, or predictors [33,69], we shall make efforts in our future work to provide a web-server for the method presented in this paper. Conflict of interest The authors declare no competing financial interests. Acknowledgments We thank Ning Huang for her constructive comments on the presentation of this paper. This workwas partially supported by the National Natural Science Foundation of China (No. 81302134 and No. 31100953), Program for Changjiang Scholars and Innovative Research Team in University (IRT 13050 to HY), Innovation Program of Shanghai Municipal Education Commission (No. 12YZ088) and the Program of Shanghai Normal University (DZL121). Appendix A. Supplementary data Supplementary data related to this article can be found at http:// dx.doi.org/10.1016/j.biochi.2014.06.001. References [1] X. Xiao, Z.C. Wu, K.C. Chou, A multi-label classifier for predicting the subcel- lular localization of gram-negative bacterial proteins with both single and multiple sites,, PLoS One 6 (2011) e20592. [2] K.C. Chou, H.B. Shen, Cell-PLoc: a package of web servers for predicting sub- cellular localization of proteins in various organisms, Nat. Protoc. 3 (2008) 153e162. [3] L. Li, Y. Zhang, L. Zou, C. Li, B. Yu, X. Zheng, Y. Zhou, An ensemble classifier for eukaryotic protein subcellular location prediction using gene ontology cate- gories and amino acid hydrophobicity, PLoS One 7 (2012) e31057. [4] C. Mooney, A. Cessieux, D.C. Shields, G. Pollastri, SCL-Epred: a generalised de novo eukaryotic protein subcellular localisation predictor, Amino Acids 45 (2013) 291e299. [5] K.C. Chou, Z.C. Wu, X. Xiao, iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Mol. Biosyst. 8 (2012) 629e641. [6] A.S.Mer,M.A.Andrade-Navarro,Anovel approach forprotein subcellular location prediction using amino acid exposure,, BMC Bioinformatics 14 (2013) 342. [7] C.J. Sigrist, E. de Castro, L. Cerutti, B.A. Cuche, N. Hulo, A. Bridge, L. Bougueleret, I. Xenarios, New and continuing developments at PROSITE, Nucleic Acids Res. 41 (2013) D344eD347. [8] T. Schuepbach, M. Pagni, A. Bridge, L. Bougueleret, I. Xenarios, L. Cerutti, pfsearchV3: a code acceleration and heuristic to search PROSITE profiles, Bioinformatics 29 (2013) 1215e1217. [9] Z. Ramsak, S. Baebler, A. Rotter, M. Korbar, I. Mozetic, B. Usadel, K. Gruden, GoMapMan: integration, consolidation and visualization of plant gene anno- tations within the MapMan ontology, Nucleic Acids Res. 42 (2014) D1167eD1175. [10] L.Q. Li, Y. Zhang, L.Y. Zou, Y. Zhou, X.Q. Zheng, Prediction of protein subcellular multi-localization based on the general form of Chou's pseudo amino acid composition, Protein Pept. Lett. 19 (2012) 375e387. [11] T.H. Chang, L.C. Wu, T.Y. Lee, S.P. Chen, H.D. Huang, J.T. Horng, EuLoc: a web- server for accurately predict protein subcellular localization in eukaryotes by incorporating various features of sequence segments into the general form of Chou's PseAAC, J. Comput Aided Mol. Des. 27 (2013) 91e103. [12] S. Tang, T. Li, P. Cong, W. Xiong, Z. Wang, J. Sun, PlantLoc: an accurate web server for predicting plant protein subcellular localization by substantiality motif, Nucleic Acids Res. 41 (2013) W441eW447. [13] Y. Dou, B. Yao, C. Zhang, PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector ma- chine, Amino Acids 46 (2014) 1459e1469. [14] C.S. Yu, Y.C. Chen, C.H. Lu, J.K. Hwang, Prediction of protein subcellular localization, Proteins 64 (2006) 643e651. [15] M. Bhasin, A. Garg, G.P. Raghava, PSLpred: prediction of subcellular localiza- tion of bacterial proteins, Bioinformatics 21 (2005) 2522e2524. [16] N.Y. Yu, J.R. Wagner, M.R. Laird, G. Melli, S. Rey, R. Lo, P. Dao, S.C. Sahinalp, M. Ester, L.J. Foster, F.S. Brinkman, PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes, Bioinformatics 26 (2010) 1608e1615. [17] Y. Saeys, I. Inza, P. Larranaga, A review of feature selection techniques in bioinformatics,, Bioinformatics 23 (2007) 2507e2517. [18] C. Fernandez-Lozano, E. Fernandez-Blanco, K. Dave, N. Pedreira, M. Gestal, J. Dorado, C.R. Munteanu, Improving enzyme regulatory protein classification by means of SVM-RFE feature selection, Mol. Biosyst. 10 (2014) 1063e1071. [19] K.C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol. 273 (2011) 236e247. [20] W. Chen, P.M. Feng, H. Lin, K.C. Chou, iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res. 41 (2013) e68. [21] J.L. Min, X. Xiao, K.C. Chou, iEzy-drug: a web server for identifying the interaction between enzymes and drugs in cellular networking, Biomed. Res. Int. 2013 (2013) 701317. [22] Y. Xu, X.J. Shao, L.Y. Wu, N.Y. Deng, K.C. Chou, iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitro- sylation sites in proteins, Peer J. 1 (2013) e171. [23] X. Xiao, J.L. Min, P. Wang, K.C. Chou, iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints, J. Theor. Biol. 337 (2013) 71e79. http://dx.doi.org/10.1016/j.biochi.2014.06.001 http://dx.doi.org/10.1016/j.biochi.2014.06.001 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref1 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref1 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref1 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref2 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref2 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref2 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref2 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref3 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref3 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref3 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref4 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref4 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref4 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref4 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref5 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref5 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref5 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref5 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref6 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref6 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref7 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref7 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref7 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref7 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref8 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref8 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref8 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref8 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref9 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref9 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref9 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref9 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref9 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref10 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref10 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref10 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref10 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref11 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref11 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref11 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref11 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref11 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref12 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref12 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref12 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref12 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref13 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref13 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref13 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref13 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref14 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref14 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref14 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref15 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref15 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref15 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref16 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref16 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref16 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref16 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref16 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref17 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref17 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref17 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref18 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref18 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref18 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref18 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref19 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref19 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref19 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref20 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref20 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref21 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref21 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref21 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref22 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref22 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref22 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref23 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref23 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref23 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref23 L. Li et al. / Biochimie 104 (2014) 100e107 107 [24] Y.N. Fan, X. Xiao, J.L. Min, K.C. Chou, iNR-Drug: predicting the interaction of drugs with nuclear receptors in cellular networking, Int. J. Mol. Sci. 15 (2014) 4915e4937. [25] S.H. Guo, E.Z. Deng, L.Q. Xu, H. Ding, H. Lin, W. Chen, K.C. Chou, iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in ge- nomes with pseudo k-tuple nucleotide composition, Bioinformatics 30 (2014) 1522e1529. [26] B. Liu, D. Zhang, R. Xu, J. Xu, X. Wang, Q. Chen, Q. Dong, K.C. Chou, Combining evolutionary information extracted from frequency profiles with sequence- based kernels for protein remote homology detection, Bioinformatics 30 (2014) 472e479. [27] W.R. Qiu, X. Xiao, K.C. Chou, iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components, Int. J. Mol. Sci. 15 (2014) 1746e1766. [28] G.L. Fan, Q.Z. Li, Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou's pseudo amino acid composition, J. Theor. Biol. 304 (2012) 88e95. [29] K.C. Chou, H.B. Shen, Recent progress in protein subcellular location predic- tion, Anal. Biochem. 370 (2007) 1e16. [30] W.Z. Lin, J.A. Fang, X. Xiao, K.C. Chou, iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins, Mol. Bio- syst. 9 (2013) 634e644. [31] Z.C. Wu, X. Xiao, K.C. Chou, iLoc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex gram-positive bacte- rial proteins, Protein Pept. Lett. 19 (2012) 4e14. [32] K.C. Chou, Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins 43 (2001) 246e255. [33] S.X. Lin, J. Lapointe, Theoretical and experimental biology in one, J. Biomed. Sci. Eng. 6 (2013) 435e442. [34] H. Mohabatkar, Prediction of cyclin proteins using Chou's pseudo amino acid composition, Protein Pept. Lett. 17 (2010) 1207e1214. [35] M. Esmaeili, H. Mohabatkar, S. Mohsenzadeh, Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillo- maviruses, J. Theor. Biol. 263 (2010) 203e209. [36] H. Lin, The modified Mahalanobis discriminant for predicting outer mem- brane proteins by using Chou's pseudo amino acid composition, J. Theor. Biol. 252 (2008) 350e356. [37] Y.H. Zeng, Y.Z. Guo, R.Q. Xiao, L. Yang, L.Z. Yu, M.L. Li, Using the augmented Chou's pseudo amino acid composition for predicting protein sub- mitochondria locations based on auto covariance approach, J. Theor. Biol. 259 (2009) 366e372. [38] X.B. Zhou, C. Chen, Z.C. Li, X.Y. Zou, Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme sub- family classes, J. Theor. Biol. 248 (2007) 546e551. [39] T.E. Karakasidis, D.N. Georgiou, J.J. Nieto, A. Torres, Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition, J. Theor. Biol. 257 (2009) 17e26. [40] H. Mohabatkar, M. Mohammad Beigi, A. Esmaeili, Prediction of GABA(A) re- ceptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine, J. Theor. Biol. 281 (2011) 18e23. [41] L. Yu, Y. Guo, Y. Li, G. Li, M. Li, J. Luo, W. Xiong, W. Qin, SecretP: identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition, J. Theor. Biol. 267 (2010) 1e6. [42] C. Huang, J. Yuan, Using radial basis function on the general form of Chou's pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites, Biosystems 113 (2013) 50e57. [43] S. Wan, M.W. Mak, S.Y. Kung, GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo-amino acid composition, J. Theor. Biol. 323 (2013) 40e48. [44] S. Mei, Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning, J. Theor. Biol. 310 (2012) 80e87. [45] K.C. Chou, Pseudo amino acid composition and its applications in bioinfor- matics, proteomics and system biology, Curr. Proteomics 6 (2009) 262e274. [46] H.B. Shen, K.C. Chou, PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition, Anal. Biochem. 373 (2008) 386e388. [47] P. Du, X. Wang, C. Xu, Y. Gao, PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid composi- tions, Anal. Biochem. 425 (2012) 117e119. [48] D.S. Cao, Q.S. Xu, Y.Z. Liang, propy: a tool to generate various modes of Chou's PseAAC, Bioinformatics 29 (2013) 960e962. [49] P. Du, S. Gu, Y. Jiao, PseAAC-General: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein data- sets, Int. J. Mol. Sci. 15 (2014) 3495e3506. [50] H.B. Rao, F. Zhu, G.B. Yang, Z.R. Li, Y.Z. Chen, Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence, Nucleic Acids Res. 39 (2011) W385eW390. [51] A.N. Sarangi, M. Lohani, R. Aggarwal, Prediction of essential proteins in pro- karyotes by incorporating various physico-chemical features into the general form of Chou's pseudo amino acid composition, Protein Pept. Lett. 20 (2013) 781e795. [52] Z.R. Li, H.H. Lin, L.Y. Han, L. Jiang, X. Chen, Y.Z. Chen, PROFEAT: a web server for computing structural and physicochemical features of proteins and pep- tides from amino acid sequence, Nucleic Acids Res. 34 (2006) W32eW37. [53] L. Li, X. Cui, S. Yu, Y. Zhang, Z. Luo, H. Yang, Y. Zhou, X. Zheng, PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physicalechemical property and functional annota- tions, PLoS One 9 (2014) e92863. [54] E.E. Bron, R.M. Steketee, G.C. Houston, R.A. Oliver, H.C. Achterberg, M. Loog, J.C. van Swieten, A. Hammers, W.J. Niessen, M. Smits, S. Klein, Diagnostic classification of arterial spin labeling and structural MRI in presenile early stage dementia, Hum. Brain Mapp. (2014) [Epub ahead of print]. [55] L. Kong, L. Zhang, J. Lv, Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou's pseudo amino acid composition, J. Theor. Biol. 344 (2014) 12e18. [56] X. Wei, J. Ai, Y. Deng, X. Guan, D.R. Johnson, C.Y. Ang, C. Zhang, E.J. Perkins, Identification of biomarkers that distinguish chemical contaminants based on gene expression profiles, BMC Genom. 15 (2014) 248. [57] B. Panwar, A. Arora, G.P. Raghava, Prediction and classification of ncRNAs using structural information, BMC Genom. 15 (2014) 127. [58] L. Chen, J. Lu, N. Zhang, T. Huang, Y.D. Cai, A hybrid method for prediction and repositioning of drug anatomical therapeutic chemical classes,, Mol. Biosyst. 10 (2014) 868e877. [59] R. Das Roy, D. Dash, Selection of relevant features from amino acids enables development of robust classifiers, Amino Acids 46 (2014) 1343e1351. [60] Y. Yang, B. Chen, G. Tan, M. Vihinen, B. Shen, Structure-based prediction of the effects of a missense variant on protein stability, Amino Acids 44 (2013) 847e855. [61] Y. Xu, J. Ding, L.Y. Wu, K.C. Chou, iSNO-PseAAC: predict cysteine S-nitro- sylation sites in proteins by incorporating position specific amino acid pro- pensity into pseudo amino acid composition, PLoS One 8 (2013) e55844. [62] Y. Xu, X. Wen, X.J. Shao, iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific pro- pensity into pseudo amino acid composition, Int. J. Mol. Sci. 15 (2014) 7594e7610. [63] K.C. Chou, Z.C. Wu, X. Xiao, iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PLoS One 6 (2011) e18258. [64] L. Chen, W.M. Zeng, Y.D. Cai, K.Y. Feng, K.C. Chou, Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chem- icalechemical interactions and similarities, PLoS One 7 (2012) e35254. [65] X. Xiao, P. Wang, W.Z. Lin, J.H. Jia, K.C. Chou, iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types, Anal. Biochem. 436 (2013) 168e177. [66] K.C. Chou, Some remarks on predicting multi-label attributes in molecular biosystems, Mol. Biosyst. 9 (2013) 1092e1100. [67] Z.C. Wu, X. Xiao, K.C. Chou, iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Mol. Biosyst. 7 (2011) 3287e3297. [68] X. Xiao, Z.C. Wu, K.C. Chou, iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. Theor. Biol. 284 (2011) 42e51. [69] K.C. Chou, H.B. Shen, Review: recent advances in developing web-servers for predicting protein attributes, Nat. Sci. 2 (2009) 63e92. [70] H.B. Shen, K.C. Chou, Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of gram-negative bacterial proteins, J. Theor. Biol. 264 (2010) 326e333. [71] K.C. Chou, H.B. Shen, Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of gram-positive bacterial pro- teins,, Protein Pept. Lett. 16 (2009) 1478e1484. http://refhub.elsevier.com/S0300-9084(14)00143-6/sref24 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref24 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref24 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref24 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref25 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref25 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref25 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref25 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref25 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref26 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref26 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref26 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref26 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref26 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref27 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref27 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref27 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref27 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref28 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref28 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref28 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref28 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref29 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref29 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref29 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref30 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref30 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref30 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref30 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref31 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref31 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref31 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref31 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref32 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref32 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref32 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref33 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref33 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref33 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref34 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref34 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref34 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref35 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref35 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref35 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref35 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref36 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref36 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref36 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref36 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref37 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref37 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref37 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref37 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref37 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref38 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref38 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref38 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref38 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref39 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref39 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref39 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref39 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref40 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref40 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref40 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref40 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref41 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref41 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref41 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref41 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref42 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref42 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref42 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref42 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref43 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref43 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref43 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref43 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref44 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref44 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref44 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref44 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref45 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref45 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref45 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref46 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref46 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref46 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref46 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref47 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref47 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref47 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref47 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref48 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref48 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref48 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref49 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref49 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref49 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref49 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref50 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref50 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref50 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref50 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref50 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref51 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref51 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref51 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref51 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref51 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref52 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref52 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref52 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref52 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref53 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref53 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref53 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref53 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref53 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref54 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref54 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref54 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref54 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref55 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref55 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref55 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref55 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref55 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref56 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref56 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref56 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref57 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref57 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref58 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref58 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref58 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref58 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref59 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref59 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref59 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref60 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref60 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref60 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref60 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref61 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref61 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref61 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref62 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref62 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref62 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref62 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref62 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref63 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref63 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref63 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref64 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref64 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref64 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref64 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref65 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref65 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref65 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref65 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref66 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref66 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref66 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref67 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref67 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref67 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref67 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref68 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref68 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref68 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref68 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref69 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref69 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref69 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref70 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref70 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref70 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref70 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref71 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref71 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref71 http://refhub.elsevier.com/S0300-9084(14)00143-6/sref71 Prediction of bacterial protein subcellular localization by incorporating various features into Chou's PseAAC and a backwar ... 1 Introduction 2 Materials and methods 2.1 Datasets 2.2 Feature preparation 2.2.1 Evolutionary information from PSI-blast profile 2.2.2 Physiochemical properties and structural features by PROFEAT 2.2.3 Gene ontology formulation 2.3 Backward feature selection 2.4 The SVM ensemble classifier 2.5 Evaluation parameters 3 Results and discussion 3.1 Parameter selection 3.2 Comparison with other methods 3.3 Case study 4 Conclusions Conflict of interest Acknowledgments Appendix A Supplementary data References