Implementation of machine learning models to determine the appropriate model for protein function prediction

Yekaterina Golenko; Aisulu  Ismailova; Anargul Shaushenova; Zhazira Mutalova; Damir Dossalyanov; Aliya Ainagulova; Akgul Naizagarayeva

doi:10.15587/1729-4061.2022.263270

Authors

Yekaterina Golenko S. Seifullin Kazakh Agrotechnical University, Kazakhstan https://orcid.org/0000-0002-4643-4571
Aisulu Ismailova S. Seifullin Kazakh Agrotechnical University, Kazakhstan https://orcid.org/0000-0002-3164-3688
Anargul Shaushenova S. Seifullin Kazakh Agrotechnical University, Kazakhstan https://orcid.org/0000-0002-3164-3688
Zhazira Mutalova Zhangir khan West Kazakhstan Agrarian Technical University, Kazakhstan https://orcid.org/0000-0001-9912-5978
Damir Dossalyanov Narxoz University, Kazakhstan https://orcid.org/0000-0003-4872-2822
Aliya Ainagulova S. Seifullin Kazakh Agrotechnical University, Kazakhstan https://orcid.org/0000-0003-4872-2822
Akgul Naizagarayeva S. Seifullin Kazakh Agrotechnical University, Kazakhstan https://orcid.org/0000-0002-3616-0925

DOI:

https://doi.org/10.15587/1729-4061.2022.263270

Keywords:

protein function prediction, classification, neural networks, ProtCNN, bidirectional long short-term memory (BiLSTM)

Abstract

Predicting the function of proteins is a crucial part of genome annotation, which can help in solving a wide range of biological problems. Many methods are available to predict the functions of proteins. However, except for sequence, most features are difficult to obtain or are not available for many proteins, which limits their scope. In addition, the performance of sequence-based feature prediction methods is often lower than that of methods that involve multiple features, and protein feature prediction can be time-consuming. Recent advances in this field are associated with the development of machine learning, which shows great progress in solving the problem of predicting protein functions. Today, however, most protein sequences have the status of «uncharacterized» or «putative».

The need to assess the accuracy of identification of protein functions is an urgent task for machine learning approaches used to predict protein functions. In this study, the performance of two popular function prediction algorithms (ProtCNN and BiLSTM) was assessed from two perspectives and the procedures for building these models were described.

As a result of the study of Pfam families, ProtCNN achieves an accuracy rate of 0.988 % and bidirectional LSTM has an accuracy rate of 0.9506 %. The use of the Pfam dataset allowed increasing the classification accuracy due to the large training dataset. The quality of the prediction increases with a large amount of training data.

The study demonstrated that machine learning algorithms can be used as an effective tool for building protein function prediction models, in particular, the CNN network can be adapted as an accurate tool for annotating protein functions in the presence of large datasets.

Author Biographies

Yekaterina Golenko, S. Seifullin Kazakh Agrotechnical University

Master of Science in Engineering

Department of Information Systems

Aisulu Ismailova, S. Seifullin Kazakh Agrotechnical University

PhD

Department of Information Systems

Anargul Shaushenova, S. Seifullin Kazakh Agrotechnical University

Candidate of Technical Sciences

Department of Information Systems

Zhazira Mutalova, Zhangir khan West Kazakhstan Agrarian Technical University

Master of Technical Sciences

Higher School of Information Technologies

Damir Dossalyanov, Narxoz University

PhD

Department of Public and Local Management

Aliya Ainagulova, S. Seifullin Kazakh Agrotechnical University

Candidate of Technical Sciences

Akgul Naizagarayeva, S. Seifullin Kazakh Agrotechnical University

Master of Science in Engineering

Department of Information Systems

References

Gabaldon, T., Huynen, M. A. (2004). Prediction of protein function and pathways in the genome era. Cellular and Molecular Life Sciences (CMLS), 61 (7-8), 930–944. doi: https://doi.org/10.1007/s00018-003-3387-y
du Plessis, L., Skunca, N., Dessimoz, C. (2011). The what, where, how and why of gene ontology--a primer for bioinformaticians. Briefings in Bioinformatics, 12 (6), 723–735. doi: https://doi.org/10.1093/bib/bbr002
Barrell, D., Dimmer, E., Huntley, R. P., Binns, D., O’Donovan, C., Apweiler, R. (2009). The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Research, 37, D396–D403. doi: https://doi.org/10.1093/nar/gkn803
Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C., Tosatto, S. C. E. (2015). INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Research, 43 (W1), W134–W140. doi: https://doi.org/10.1093/nar/gkv523
Boratyn, G. M., Camacho, C., Cooper, P. S., Coulouris, G., Fong, A., Ma, N. et. al. (2013). BLAST: a more efficient report with usability improvements. Nucleic Acids Research, 41 (W1), W29–W33. doi: https://doi.org/10.1093/nar/gkt282
Stephenson, N., Shane, E., Chase, J., Rowland, J., Ries, D., Justice, N. et. al. (2019). Survey of Machine Learning Techniques in Drug Discovery. Current Drug Metabolism, 20 (3), 185–193. doi: https://doi.org/10.2174/1389200219666180820112457
Lobley, A. E., Nugent, T., Orengo, C. A., Jones, D. T. (2008). FFPred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Research, 36, W297–W302. doi: https://doi.org/10.1093/nar/gkn193
Cozzetto, D., Minneci, F., Currant, H., Jones, D. T. (2016). FFPred 3: feature-based function prediction for all Gene Ontology domains. Scientific Reports, 6 (1). doi: https://doi.org/10.1038/srep31865
Jung, J., Yi, G., Sukno, S. A., Thon, M. R. (2010). PoGO: Prediction of Gene Ontology terms for fungal proteins. BMC Bioinformatics, 11 (1). doi: https://doi.org/10.1186/1471-2105-11-215
Törönen, P., Medlar, A., Holm, L. (2018). PANNZER2: a rapid functional annotation web server. Nucleic Acids Research, 46 (W1), W84–W88. doi: https://doi.org/10.1093/nar/gky350
You, R., Huang, X., Zhu, S. (2018). DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation. Methods, 145, 82–90. doi: https://doi.org/10.1016/j.ymeth.2018.05.026
You, R., Yao, S., Xiong, Y., Huang, X., Sun, F., Mamitsuka, H., Zhu, S. (2019). NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Research, 47 (W1), W379–W387. doi: https://doi.org/10.1093/nar/gkz388
Kulmanov, M., Khan, M. A., Hoehndorf, R. (2017). DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics, 34 (4), 660–668. doi: https://doi.org/10.1093/bioinformatics/btx624
Cai, Y., Wang, J., Deng, L. (2020). SDN2GO: An Integrated Deep Learning Model for Protein Function Prediction. Frontiers in Bioengineering and Biotechnology, 8. doi: https://doi.org/10.3389/fbioe.2020.00391
Du, Z., He, Y., Li, J., Uversky, V. N. (2020). DeepAdd: Protein function prediction from k-mer embedding and additional features. Computational Biology and Chemistry, 89, 107379. doi: https://doi.org/10.1016/j.compbiolchem.2020.107379
Zhang, F., Song, H., Zeng, M., Wu, F.-X., Li, Y., Pan, Y., Li, M. (2021). A Deep Learning Framework for Gene Ontology Annotations With Sequence- and Network-Based Information. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18 (6), 2208–2217. doi: https://doi.org/10.1109/tcbb.2020.2968882
Spalević, S., Veličković, P., Kovačević, J., Nikolić, M. (2020). Hierarchical Protein Function Prediction with Tail-GNNs. arXiv. doi: https://doi.org/10.48550/arXiv.2007.12804
LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436–444. doi: https://doi.org/10.1038/nature14539
Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., Chen, Z. (2017). ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules, 22 (10), 1732. doi: https://doi.org/10.3390/molecules22101732
Jiang, Y., Oron, T. R., Clark, W. T., Bankapur, A. R., D’Andrea, D., Lepore, R. et. al. (2016). An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biology, 17 (1). doi: https://doi.org/10.1186/s13059-016-1037-6
Pearson, W. R. (2015). Protein Function Prediction: Problems and Pitfalls. Current Protocols in Bioinformatics, 51 (1). doi: https://doi.org/10.1002/0471250953.bi0412s51
UniProt: the universal protein knowledgebase (2016). Nucleic Acids Research, 45 (D1), D158–D169. doi: https://doi.org/10.1093/nar/gkw1099
Pfam 35.0 is released. Xfam Blog. Available at: https://xfam.wordpress.com/2021/11/19/pfam-35-0-is-released/
Bileschi, M. L., Belanger, D., Bryant, D., Sanderson, T., Carter, B., Sculley, D. et. al. (2019). Using Deep Learning to Annotate the Protein Universe. bioRxiv. doi: https://doi.org/10.1101/626507
Vu, T. T. D., Jung, J. (2021). Protein function prediction with gene ontology: from traditional to deep learning models. PeerJ, 9, e12019. doi: https://doi.org/10.7717/peerj.12019
Abduljabbar, R. L., Dia, H., Tsai, P.-W. (2021). Unidirectional and Bidirectional LSTM Models for Short-Term Traffic Prediction. Journal of Advanced Transportation, 2021, 1–16. doi: https://doi.org/10.1155/2021/5589075
Kurtukova, A. V., Romanov, A. S. (2019). Modeling the neural network architecture to identify the author of the source code. Proceedings of Tomsk State University of Control Systems and Radioelectronics, 22 (3), 37–42. doi: https://doi.org/10.21293/1818-0442-2019-22-3-37-42
Deen, A., Gayanchandani, M. (2019). Protein Function Prediction using SVM Kernel Approach. International Journal of Scientific & Engineering Research, 10 (7), 1995–2000. Available at: https://www.ijser.org/researchpaper/Protein-Function-Prediction-using-SVM-Kernel-Approach.pdf
Kingma, D. P., Ba, J. (2014). Adam: A Method for Stochastic Optimization. 3rd International Conference for Learning Representations. San Diego. doi: https://doi.org/10.48550/arXiv.1412.6980

Implementation of machine learning models to determine the appropriate model for protein function prediction

Authors

DOI:

Keywords:

Abstract

Author Biographies

Yekaterina Golenko, S. Seifullin Kazakh Agrotechnical University

Aisulu Ismailova, S. Seifullin Kazakh Agrotechnical University

Anargul Shaushenova, S. Seifullin Kazakh Agrotechnical University

Zhazira Mutalova, Zhangir khan West Kazakhstan Agrarian Technical University

Damir Dossalyanov, Narxoz University

Aliya Ainagulova, S. Seifullin Kazakh Agrotechnical University

Akgul Naizagarayeva, S. Seifullin Kazakh Agrotechnical University

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By

Current Issue