Please wait a minute...
浙江大学学报(理学版)  2022, Vol. 49 Issue (3): 280-286    DOI: 10.3785/j.issn.1008-9497.2022.03.003
数学与计算机科学     
紫色球杆菌视紫红质光谱特性的机器学习研究
郏丽丽,孙婷婷()
浙江科技学院 理学院,浙江 杭州 310023
A machine learning study on gloeobacter violaceus rhodopsin spectral properties
Lili JIA,Tingting SUN()
School of Sciences College,Zhejiang University of Science and Technology,Hangzhou 310023,China
 全文: PDF(864 KB)   HTML( 2 )
摘要:

近年来,机器学习等人工智能技术被应用于蛋白质工程,其在蛋白质结构、功能预测、催化活性等研究中具有独特优势。在未知蛋白质结构的情况下,将蛋白质序列和功能特性与机器学习相结合,基于序列-活性关系(innovative sequence-activity relationship,ISAR)算法,将蛋白质氨基酸序列数字化,用快速傅里叶变换(fast four transform,FFT)进行预处理,再进行偏最小二乘回归建模,可在数据集较少情况下拟合得到最佳模型。通过机器学习对紫色球杆菌视紫红质(gloeobacter violaceus rhodopsin,GR)的突变体蛋白质氨基酸序列与光谱最大吸收波长进行建模,获得了最佳模型。用最佳索引LEVM760106建模得到的确定系数R2 为0.944,均方误差E为11.64。用小波变换进行的预处理,其R2 虽也约为0.944,但E大于11.64,不及FFT进行的预处理。方法较好地解决了蛋白质序列与功能特性之间的数学建模问题,在蛋白质工程中可为预测更优的突变体提供支持。

关键词: 机器学习数字信号处理光谱特性    
Abstract:

In recent years, artificial intelligence technologies such as machine learning have been applied to protein engineering, and have shown unique advantages in studies on as protein structure, function prediction, and catalytic activity. In the absence of protein structure, combining protein sequence and functional properties with machine learning is a new research direction. In this papers, based on a new sequence-activity relationship (ISAR) method, the mutant library of gloeobacter violaceus rhodopsin (GR) and the maximum absorption wavelength of the spectrum are modeled by machine learning. It can fit the best model even in the case of a small number of data sets. The proposed method digitizes the protein amino acid sequence, preprocesses it through fast Fourier transform (FFT), and then performs partial least squares regression (PLSR) modeling. Finally, the best model of the amino acid sequence of the rhodopsin mutant protein and the maximum absorption wavelength of the spectrum is obtained. Modeling with the best index LEVM760106, the coefficient of determination is that R2 is 0.944, and the minimum mean square error E is 11.64. In contrast, when the wavelet transform was used to preprocess the data, the coefficient of determination is close to 0.944, but the E is greater than 11.64, not as good as the result of FFT preprocessing. It is shown that, this method effectively solves the mathematical model relationship between protein sequence and functional characteristics, and provides support for predicting better mutants in later protein engineering.

Key words: machine learning    digital signal processing (DSP)    spectral characteristics
收稿日期: 2021-03-02 出版日期: 2022-05-24
CLC:  Q 332  
基金资助: 浙江省自然科学基金资助项目(LY17A040001)
通讯作者: 孙婷婷     E-mail: tingtingsun@zust.edu.cn
作者简介: 郏丽丽(1993—),ORCID: https://orcid.org/0000-0002-3215-5627,女,硕士,主要从事机器学习、生物统计研究.
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  
郏丽丽
孙婷婷

引用本文:

郏丽丽, 孙婷婷. 紫色球杆菌视紫红质光谱特性的机器学习研究[J]. 浙江大学学报(理学版), 2022, 49(3): 280-286.

Lili JIA, Tingting SUN. A machine learning study on gloeobacter violaceus rhodopsin spectral properties. Journal of Zhejiang University (Science Edition), 2022, 49(3): 280-286.

链接本文:

https://www.zjujournals.com/sci/CN/10.3785/j.issn.1008-9497.2022.03.003        https://www.zjujournals.com/sci/CN/Y2022/V49/I3/280

图1  ISAR算法流程
图2  用ISAR算法由GR数据得到的不同蛋白质光谱
图3  3个蛋白质通过FFT转变为蛋白质光谱
图4  根据不同参数获得550个λmax模型的E值
索引R2E索引描述
LEVM7601060.94411.64范德华参数R025
LEVM7601020.85817.85C-α与侧链质心的距离25
CEDJ9701040.80020.75细胞内蛋白质的氨基酸组成(百分比)26
CHOC7601040.48326.35100%掩埋的残留物比例26
FINA9101020.05833.70位置ii+1,i+2处的螺旋起始参数27
表1  不同索引下的R2和E
图5  R2= 0.944时GR及突变体的LOOCV预测
索引数据处理方法

交叉验证

方法

样本数ER2
LEVM760106FFTLOOCV8111.640.944
LEVM760106FFT

十折交叉

验证法

8114.760.908
LEVM760106小波变换LOOCV8112.120.940
LEVM760107小波变换

十折交叉

验证法

8117.350.871
表2  不同方法对GR的验证结果
氨基酸名称数值氨基酸名称数值
丙氨酸5.2亮氨酸7.0
精氨酸6.0赖氨酸6.0
天冬酰胺5.0甲硫氨酸6.8
天冬氨酸5.0苯丙氨酸7.1
半胱氨酸6.1脯氨酸6.2
谷氨酰胺6.0丝氨酸4.9
谷氨酸6.0苏氨酸5.0
甘氨酸4.2色氨酸7.6
组氨酸6.0酪氨酸7.1
异亮氨酸7.0缬氨酸6.4
表3  索引LEVM760106中20种氨基酸的数值表示
1 MUGGLETON S, KING R D, STENBERG M J E. Protein secondary structure prediction using logic-based machine learning[J]. Protein Engineering, 1992, 5(7): 647-657. DOI:10.1093/protein/5.7.647
doi: 10.1093/protein/5.7.647
2 易华伟, 唐晓峰. 基于氨基酸序列和模拟结构预测蛋白质稳定性的研究进展[J]. 生物技术通报, 2017, 33(4): 83-89. DOI:10.13560/j.cnki.biotech.bull. 1985.2017.04.011
YI H W, TANG X F. Research progress on the prediction of protein stability based on amino acid sequence and simulated structure[J]. Biotechnology Bulletin, 2017, 33(4): 83-89. DOI:10.13560/j.cnki.biotech.bull.1985.2017.04.011
doi: 10.13560/j.cnki.biotech.bull.1985.2017.04.011
3 程淑萍, 谭建军, 门婧睿. 基于机器学习方法的非编码RNA-蛋白质相互作用的预测[J]. 北京生物医学工程, 2019, 38(4): 353-359. DOI:10.3969/j.issn. 1002-3208.2019.04.004
CHENG S P, TAN J J, MEN J R. Prediction of ncRNA-protein interactions based on machine learning methods[J]. Beijing Biomedical Engineering, 2019, 38 (4): 353-359. DOI:10.3969/j.issn.1002-3208.2019.04.004
doi: 10.3969/j.issn.1002-3208.2019.04.004
4 徐开琨, 韩明飞, 黄传玺, 等. 基于质谱的蛋白质生物标志物发现中的特征选择与机器学习方法研究进展[J]. 生物工程学报, 2019, 35(9): 1619-1632. DOI:10. 13345/j.cjb.190064
XU K K, HAN M F, HUANG C X, et al. Research progress of feature selection and machine learning methods for mass spectrometry-based protein biomarker discovery[J]. Chinese Journal of Biotechnology, 2019, 35(9): 1619-1632. DOI:10. 13345/j.cjb.190064
doi: 10. 13345/j.cjb.190064
5 胡如云, 张嵩亚, 蒙海林, 等. 面向合成生物学的机器学习方法及应用[J]. 科学通报, 2021, 66(3): 284-299. DOI:10.1360/TB-2020-0456
HU R Y, ZHANG S Y, MENG H L, et al. Machine learning for synthetic biology: Methods and applications[J]. Chinese Science Bulletin, 2021, 66(3): 284-299. DOI:10.1360/TB-2020-0456
doi: 10.1360/TB-2020-0456
6 HAMMER S C, KNIGHT A M, ARNOLD F H. Design and evolution of enzymes for non-natural chemistry[J]. Current Opinion in Green and Sustainable Chemistry, 2017, 7: 23-30. DOI:10. 1016/j.cogsc.2017.06.002
doi: 10. 1016/j.cogsc.2017.06.002
7 CHOI Y H, KIM J H, PARK B S, et al. Solubilization and iterative saturation mutagenesis of α1,3-fucosyltransferase from helicobacter pylori to enhance its catalytic efficiency[J]. Biotechnology and Bioengineering, 2016, 113(8): 1666-1675. DOI:10. 1002/bit.25944
doi: 10. 1002/bit.25944
8 曲戈, 朱彤, 蒋迎迎, 等. 蛋白质工程: 从定向进化到计算设计[J]. 生物工程学报, 2019, 35(10): 1843-1856. DOI:10.13345/j.cjb.190221
QU G, ZHU T, JIANG Y Y, et al. Protein engineering: From directed evolution to computational design[J]. Chinese Journal of Biotechnology, 2019, 35(10): 1843-1856. DOI:10.13345/j.cjb.190221
doi: 10.13345/j.cjb.190221
9 蒋迎迎, 曲戈, 孙周通. 机器学习助力酶定向 进化[J]. 生物学杂志, 2020, 37(4): 1-11. DOI:10.3969/j.issn.2095-1736.2020.04.001
JIANG Y Y, QU G, SUN Z T. Machine learning assisted enzyme directed evolution[J]. Journal of Biology, 2020, 37(4): 1-11. DOI:10.3969/j.issn. 2095-1736.2020.04.001
doi: 10.3969/j.issn. 2095-1736.2020.04.001
10 MOSELEY L G. Introduction to machine learning[J]. Engineering Applications of Artificial Intelligence, 1988, 1(4): 334. DOI:10.1016/0952-1976(88)90057-7
doi: 10.1016/0952-1976(88)90057-7
11 CADET F, FONTAINE N, LI G Y, et al. A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes[J]. Scientific Reports, 2018, 8(1): 16757. DOI:10. 1038/s41598-018-35033-y
doi: 10. 1038/s41598-018-35033-y
12 FONTAINE N, CADET F. Method and electronic system for predicting at least one fitness value of a protein, related computer program product: U.S. Patent Application 15/565,893[P]. 2018-04-05.
13 CADET F, FONTAINE N, VETRIVEL I, et al. Application of fourier transform and proteochemometrics principles to protein engineering[J]. BMC Bioinformatics, 2018, 19(1): 382. DOI:10.1186/s12859-018-2407-8
doi: 10.1186/s12859-018-2407-8
14 FONTAINE N, CADET F, VETRIVEL I. Novel descriptors and digital signal processing-Based method for protein sequence activity relationship study[J]. International Journal of Molecular Sciences, 2019, 20(22): 5640. DOI:10.3390/ijms20225640
doi: 10.3390/ijms20225640
15 OSTAFE R, FONTAINE N, FRANK D, et al. One-shot optimization of multiple enzyme parameters: Tailoring glucose oxidase for pH and electron mediators[J]. Biotechnology and Bioengineering, 2020, 117(1): 17-29. DOI:10. 1002/bit.27169
doi: 10. 1002/bit.27169
16 BÉJÀ O, ARAVIND L, KOONIN E V, et al. Bacterial rhodopsin: Evidence for a new type of phototrophy in the sea[J]. Science, 2000, 289(5486): 1902-1906. DOI:10.1126/science.289.5486.1902
doi: 10.1126/science.289.5486.1902
17 BROWN L S, JUNG K H. Bacteriorhodopsin-like proteins of eubacteria and fungi: The extent of conservation of the haloarchaeal proton-pumping mechanism[J]. Photochemical & Photobiological Sciences, 2006, 5(6): 538-546. DOI:10.1039/b514537f
doi: 10.1039/b514537f
18 CLAASSENS N J, VOLPERS M, SANTOS V A P M D, et al. Potential of proton-pumping rhodopsins: Engineering photosystems into microorganisms[J]. Trends in Biotechnology, 2013, 31(11): 633-642. DOI:10.1016/j.tibtech.2013.08.006
doi: 10.1016/j.tibtech.2013.08.006
19 ENGQVIST M K M, MCLSAAC R S, DOLLINGER P, et al. Directed evolution of Gloeobacter violaceus rhodopsin spectral properties[J]. Journal of Molecular Biology, 2015, 427(1), 205-220. DOI:10.1016/j.jmb.2014.06.015
doi: 10.1016/j.jmb.2014.06.015
20 COOLEY J W, TUKEY J W. An algorithm for the machine calculation of complex Fourier series[J]. Mathematics of Computation, 1965, 19(90): 297-301. doi:10.1090/s0025-5718-1965-0178586-1
doi: 10.1090/s0025-5718-1965-0178586-1
21 SHUICHI K, PITOR P, MARIA P, et al. AAindex: Amino acid index database, progress report 2008[J]. Nucleic Acids Research, 2008, 36(Database): D202-D205. DOI:10.1093/nar/gkm998
doi: 10.1093/nar/gkm998
22 BENSON D C. Digital signal processing methods for biosequence comparison[J]. Nucleic Acids Research, 1990, 18(10): 3001-3006. DOI:10.1093/nar/18.10.3001
doi: 10.1093/nar/18.10.3001
23 YANG K K, WU Z, BEDBROOK C N, et al. Learned protein embeddings for machine learning[J]. Bioinformatics, 2018, 34(15): 2642-2648. DOI:10. 1093/bioinformatics/bty178
doi: 10. 1093/bioinformatics/bty178
24 NWANKWO N, SEKER H. Digital signal processing techniques: Calculating biological functionalities[J]. Journal of Proteomics & Bioinformatics, 2011, 4(12): 260-268. DOI:10. 4172/jpb.1000199
doi: 10. 4172/jpb.1000199
25 LEVITT M. A simplified representation of protein conformations for rapid simulation of protein folding[J]. Journal of Molecular Biology, 1976, 104(1): 59-107. DOI:10.1016/0022-2836(76)90004-8
doi: 10.1016/0022-2836(76)90004-8
26 CEDANO J, ALOY P, PÉREZ-PONS J A, et al. Relation between amino acid composition and cellular location of proteins [J]. Journal of Molecular Biology,. doi:10.1006/jmbi.1996.0804
1997, 266(3): 594-600. DOI:10.1006/jmbi.1996.0804 .
doi: 10.1006/jmbi.1996.0804
[1] 常晓洁,张华. 一种基于V-TGRU模型的资源调度算法[J]. 浙江大学学报(理学版), 2022, 49(4): 467-473.
[2] 郑素佩,闫佳,宋学力,陈荧. 求解大规模矛盾方程组的最小二乘支持向量机算法[J]. 浙江大学学报(理学版), 2022, 49(4): 435-442.
[3] 李君轶, 任涛, 陆路正. 游客情感计算的文本大数据挖掘方法比较研究[J]. 浙江大学学报(理学版), 2020, 47(4): 507-520.
[4] 潘水洋, 刘俊玮, 王一鸣. 基于神经网络的股票收益率预测研究[J]. 浙江大学学报(理学版), 2019, 46(5): 550-555.
[5] 张珣, 陈偕雄. 基于串行通信的高速DSP芯片 实验系统设计与实现[J]. 浙江大学学报(理学版), 1999, 26(4): 72-77.