Please wait a minute...
Journal of Zhejiang University (Science Edition)  2022, Vol. 49 Issue (3): 280-286    DOI: 10.3785/j.issn.1008-9497.2022.03.003
Mathematics and Computer Science     
A machine learning study on gloeobacter violaceus rhodopsin spectral properties
Lili JIA,Tingting SUN()
School of Sciences College,Zhejiang University of Science and Technology,Hangzhou 310023,China
Download: HTML( 2 )   PDF(864KB)
Export: BibTeX | EndNote (RIS)      

Abstract  

In recent years, artificial intelligence technologies such as machine learning have been applied to protein engineering, and have shown unique advantages in studies on as protein structure, function prediction, and catalytic activity. In the absence of protein structure, combining protein sequence and functional properties with machine learning is a new research direction. In this papers, based on a new sequence-activity relationship (ISAR) method, the mutant library of gloeobacter violaceus rhodopsin (GR) and the maximum absorption wavelength of the spectrum are modeled by machine learning. It can fit the best model even in the case of a small number of data sets. The proposed method digitizes the protein amino acid sequence, preprocesses it through fast Fourier transform (FFT), and then performs partial least squares regression (PLSR) modeling. Finally, the best model of the amino acid sequence of the rhodopsin mutant protein and the maximum absorption wavelength of the spectrum is obtained. Modeling with the best index LEVM760106, the coefficient of determination is that R2 is 0.944, and the minimum mean square error E is 11.64. In contrast, when the wavelet transform was used to preprocess the data, the coefficient of determination is close to 0.944, but the E is greater than 11.64, not as good as the result of FFT preprocessing. It is shown that, this method effectively solves the mathematical model relationship between protein sequence and functional characteristics, and provides support for predicting better mutants in later protein engineering.



Key wordsmachine learning      digital signal processing (DSP)      spectral characteristics     
Received: 02 March 2021      Published: 24 May 2022
CLC:  Q 332  
Corresponding Authors: Tingting SUN     E-mail: tingtingsun@zust.edu.cn
Cite this article:

Lili JIA, Tingting SUN. A machine learning study on gloeobacter violaceus rhodopsin spectral properties. Journal of Zhejiang University (Science Edition), 2022, 49(3): 280-286.

URL:

https://www.zjujournals.com/sci/EN/Y2022/V49/I3/280


紫色球杆菌视紫红质光谱特性的机器学习研究

近年来,机器学习等人工智能技术被应用于蛋白质工程,其在蛋白质结构、功能预测、催化活性等研究中具有独特优势。在未知蛋白质结构的情况下,将蛋白质序列和功能特性与机器学习相结合,基于序列-活性关系(innovative sequence-activity relationship,ISAR)算法,将蛋白质氨基酸序列数字化,用快速傅里叶变换(fast four transform,FFT)进行预处理,再进行偏最小二乘回归建模,可在数据集较少情况下拟合得到最佳模型。通过机器学习对紫色球杆菌视紫红质(gloeobacter violaceus rhodopsin,GR)的突变体蛋白质氨基酸序列与光谱最大吸收波长进行建模,获得了最佳模型。用最佳索引LEVM760106建模得到的确定系数R2 为0.944,均方误差E为11.64。用小波变换进行的预处理,其R2 虽也约为0.944,但E大于11.64,不及FFT进行的预处理。方法较好地解决了蛋白质序列与功能特性之间的数学建模问题,在蛋白质工程中可为预测更优的突变体提供支持。


关键词: 机器学习,  数字信号处理,  光谱特性 
Fig.1 The flow of ISAR methodology
Fig. 2 Different protein spectra are obtained from GR data by ISAR method
Fig.3 Three proteins are transformed into protein spectra by FFT
Fig.4 The E of λmax 550 models according to different parameters
索引R2E索引描述
LEVM7601060.94411.64范德华参数R025
LEVM7601020.85817.85C-α与侧链质心的距离25
CEDJ9701040.80020.75细胞内蛋白质的氨基酸组成(百分比)26
CHOC7601040.48326.35100%掩埋的残留物比例26
FINA9101020.05833.70位置ii+1,i+2处的螺旋起始参数27
Table 1 The R2 and E under different indexes
Fig. 5 Prediction of GR and mutants by LOOCV when R2=0.944
索引数据处理方法

交叉验证

方法

样本数ER2
LEVM760106FFTLOOCV8111.640.944
LEVM760106FFT

十折交叉

验证法

8114.760.908
LEVM760106小波变换LOOCV8112.120.940
LEVM760107小波变换

十折交叉

验证法

8117.350.871
Table 2 Verification results of GR by different methods
氨基酸名称数值氨基酸名称数值
丙氨酸5.2亮氨酸7.0
精氨酸6.0赖氨酸6.0
天冬酰胺5.0甲硫氨酸6.8
天冬氨酸5.0苯丙氨酸7.1
半胱氨酸6.1脯氨酸6.2
谷氨酰胺6.0丝氨酸4.9
谷氨酸6.0苏氨酸5.0
甘氨酸4.2色氨酸7.6
组氨酸6.0酪氨酸7.1
异亮氨酸7.0缬氨酸6.4
Table 3 Different values of 20 amino acids expressed in index LEVM760106
[1]   MUGGLETON S, KING R D, STENBERG M J E. Protein secondary structure prediction using logic-based machine learning[J]. Protein Engineering, 1992, 5(7): 647-657. DOI:10.1093/protein/5.7.647
doi: 10.1093/protein/5.7.647
[2]   易华伟, 唐晓峰. 基于氨基酸序列和模拟结构预测蛋白质稳定性的研究进展[J]. 生物技术通报, 2017, 33(4): 83-89. DOI:10.13560/j.cnki.biotech.bull. 1985.2017.04.011
YI H W, TANG X F. Research progress on the prediction of protein stability based on amino acid sequence and simulated structure[J]. Biotechnology Bulletin, 2017, 33(4): 83-89. DOI:10.13560/j.cnki.biotech.bull.1985.2017.04.011
doi: 10.13560/j.cnki.biotech.bull.1985.2017.04.011
[3]   程淑萍, 谭建军, 门婧睿. 基于机器学习方法的非编码RNA-蛋白质相互作用的预测[J]. 北京生物医学工程, 2019, 38(4): 353-359. DOI:10.3969/j.issn. 1002-3208.2019.04.004
CHENG S P, TAN J J, MEN J R. Prediction of ncRNA-protein interactions based on machine learning methods[J]. Beijing Biomedical Engineering, 2019, 38 (4): 353-359. DOI:10.3969/j.issn.1002-3208.2019.04.004
doi: 10.3969/j.issn.1002-3208.2019.04.004
[4]   徐开琨, 韩明飞, 黄传玺, 等. 基于质谱的蛋白质生物标志物发现中的特征选择与机器学习方法研究进展[J]. 生物工程学报, 2019, 35(9): 1619-1632. DOI:10. 13345/j.cjb.190064
XU K K, HAN M F, HUANG C X, et al. Research progress of feature selection and machine learning methods for mass spectrometry-based protein biomarker discovery[J]. Chinese Journal of Biotechnology, 2019, 35(9): 1619-1632. DOI:10. 13345/j.cjb.190064
doi: 10. 13345/j.cjb.190064
[5]   胡如云, 张嵩亚, 蒙海林, 等. 面向合成生物学的机器学习方法及应用[J]. 科学通报, 2021, 66(3): 284-299. DOI:10.1360/TB-2020-0456
HU R Y, ZHANG S Y, MENG H L, et al. Machine learning for synthetic biology: Methods and applications[J]. Chinese Science Bulletin, 2021, 66(3): 284-299. DOI:10.1360/TB-2020-0456
doi: 10.1360/TB-2020-0456
[6]   HAMMER S C, KNIGHT A M, ARNOLD F H. Design and evolution of enzymes for non-natural chemistry[J]. Current Opinion in Green and Sustainable Chemistry, 2017, 7: 23-30. DOI:10. 1016/j.cogsc.2017.06.002
doi: 10. 1016/j.cogsc.2017.06.002
[7]   CHOI Y H, KIM J H, PARK B S, et al. Solubilization and iterative saturation mutagenesis of α1,3-fucosyltransferase from helicobacter pylori to enhance its catalytic efficiency[J]. Biotechnology and Bioengineering, 2016, 113(8): 1666-1675. DOI:10. 1002/bit.25944
doi: 10. 1002/bit.25944
[8]   曲戈, 朱彤, 蒋迎迎, 等. 蛋白质工程: 从定向进化到计算设计[J]. 生物工程学报, 2019, 35(10): 1843-1856. DOI:10.13345/j.cjb.190221
QU G, ZHU T, JIANG Y Y, et al. Protein engineering: From directed evolution to computational design[J]. Chinese Journal of Biotechnology, 2019, 35(10): 1843-1856. DOI:10.13345/j.cjb.190221
doi: 10.13345/j.cjb.190221
[9]   蒋迎迎, 曲戈, 孙周通. 机器学习助力酶定向 进化[J]. 生物学杂志, 2020, 37(4): 1-11. DOI:10.3969/j.issn.2095-1736.2020.04.001
JIANG Y Y, QU G, SUN Z T. Machine learning assisted enzyme directed evolution[J]. Journal of Biology, 2020, 37(4): 1-11. DOI:10.3969/j.issn. 2095-1736.2020.04.001
doi: 10.3969/j.issn. 2095-1736.2020.04.001
[10]   MOSELEY L G. Introduction to machine learning[J]. Engineering Applications of Artificial Intelligence, 1988, 1(4): 334. DOI:10.1016/0952-1976(88)90057-7
doi: 10.1016/0952-1976(88)90057-7
[11]   CADET F, FONTAINE N, LI G Y, et al. A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes[J]. Scientific Reports, 2018, 8(1): 16757. DOI:10. 1038/s41598-018-35033-y
doi: 10. 1038/s41598-018-35033-y
[12]   FONTAINE N, CADET F. Method and electronic system for predicting at least one fitness value of a protein, related computer program product: U.S. Patent Application 15/565,893[P]. 2018-04-05.
[13]   CADET F, FONTAINE N, VETRIVEL I, et al. Application of fourier transform and proteochemometrics principles to protein engineering[J]. BMC Bioinformatics, 2018, 19(1): 382. DOI:10.1186/s12859-018-2407-8
doi: 10.1186/s12859-018-2407-8
[14]   FONTAINE N, CADET F, VETRIVEL I. Novel descriptors and digital signal processing-Based method for protein sequence activity relationship study[J]. International Journal of Molecular Sciences, 2019, 20(22): 5640. DOI:10.3390/ijms20225640
doi: 10.3390/ijms20225640
[15]   OSTAFE R, FONTAINE N, FRANK D, et al. One-shot optimization of multiple enzyme parameters: Tailoring glucose oxidase for pH and electron mediators[J]. Biotechnology and Bioengineering, 2020, 117(1): 17-29. DOI:10. 1002/bit.27169
doi: 10. 1002/bit.27169
[16]   BÉJÀ O, ARAVIND L, KOONIN E V, et al. Bacterial rhodopsin: Evidence for a new type of phototrophy in the sea[J]. Science, 2000, 289(5486): 1902-1906. DOI:10.1126/science.289.5486.1902
doi: 10.1126/science.289.5486.1902
[17]   BROWN L S, JUNG K H. Bacteriorhodopsin-like proteins of eubacteria and fungi: The extent of conservation of the haloarchaeal proton-pumping mechanism[J]. Photochemical & Photobiological Sciences, 2006, 5(6): 538-546. DOI:10.1039/b514537f
doi: 10.1039/b514537f
[18]   CLAASSENS N J, VOLPERS M, SANTOS V A P M D, et al. Potential of proton-pumping rhodopsins: Engineering photosystems into microorganisms[J]. Trends in Biotechnology, 2013, 31(11): 633-642. DOI:10.1016/j.tibtech.2013.08.006
doi: 10.1016/j.tibtech.2013.08.006
[19]   ENGQVIST M K M, MCLSAAC R S, DOLLINGER P, et al. Directed evolution of Gloeobacter violaceus rhodopsin spectral properties[J]. Journal of Molecular Biology, 2015, 427(1), 205-220. DOI:10.1016/j.jmb.2014.06.015
doi: 10.1016/j.jmb.2014.06.015
[20]   COOLEY J W, TUKEY J W. An algorithm for the machine calculation of complex Fourier series[J]. Mathematics of Computation, 1965, 19(90): 297-301. doi:10.1090/s0025-5718-1965-0178586-1
doi: 10.1090/s0025-5718-1965-0178586-1
[21]   SHUICHI K, PITOR P, MARIA P, et al. AAindex: Amino acid index database, progress report 2008[J]. Nucleic Acids Research, 2008, 36(Database): D202-D205. DOI:10.1093/nar/gkm998
doi: 10.1093/nar/gkm998
[22]   BENSON D C. Digital signal processing methods for biosequence comparison[J]. Nucleic Acids Research, 1990, 18(10): 3001-3006. DOI:10.1093/nar/18.10.3001
doi: 10.1093/nar/18.10.3001
[23]   YANG K K, WU Z, BEDBROOK C N, et al. Learned protein embeddings for machine learning[J]. Bioinformatics, 2018, 34(15): 2642-2648. DOI:10. 1093/bioinformatics/bty178
doi: 10. 1093/bioinformatics/bty178
[24]   NWANKWO N, SEKER H. Digital signal processing techniques: Calculating biological functionalities[J]. Journal of Proteomics & Bioinformatics, 2011, 4(12): 260-268. DOI:10. 4172/jpb.1000199
doi: 10. 4172/jpb.1000199
[25]   LEVITT M. A simplified representation of protein conformations for rapid simulation of protein folding[J]. Journal of Molecular Biology, 1976, 104(1): 59-107. DOI:10.1016/0022-2836(76)90004-8
doi: 10.1016/0022-2836(76)90004-8
[26]   CEDANO J, ALOY P, PÉREZ-PONS J A, et al. Relation between amino acid composition and cellular location of proteins [J]. Journal of Molecular Biology,. doi:10.1006/jmbi.1996.0804
1997, 266(3): 594-600. DOI:10.1006/jmbi.1996.0804 .
doi: 10.1006/jmbi.1996.0804
[1] Supei ZHENG,Jia YAN,Xueli SONG,Ying CHEN. A least square support vector machine algorithm for solving huge contradictory equations[J]. Journal of Zhejiang University (Science Edition), 2022, 49(4): 435-442.
[2] Xiaojie CHANG,Hua ZHANG. A resource scheduling algorithm based on V-TGRU model[J]. Journal of Zhejiang University (Science Edition), 2022, 49(4): 467-473.
[3] LI Junyi, REN Tao, LU Luzheng. A comparative study of big text data mining methods on tourist emotion computing[J]. Journal of Zhejiang University (Science Edition), 2020, 47(4): 507-520.
[4] PAN Shuiyang, LIU Junwei, WANG Yiming. Forecasting stock returns with artificial neural networks.[J]. Journal of Zhejiang University (Science Edition), 2019, 46(5): 550-555.