Please wait a minute...
浙江大学学报(工学版)  2019, Vol. 53 Issue (5): 819-828    DOI: 10.3785/j.issn.1008-973X.2019.05.001
计算机与控制工程     
Stack Overflow上机器学习相关问题的大规模实证研究
万志远1(),陶嘉恒2,梁家坤2,才振功2,*(),苌程1,乔林3,周巧妮3
1. 浙江大学 计算机科学与技术学院,浙江 杭州 310027
2. 浙江大学 软件学院,浙江 宁波 315048
3. 国网辽宁省电力有限公司 信息通信分公司,辽宁 沈阳 110006
Large-scale empirical study on machine learning related questions on Stack Overflow
Zhi-yuan WAN1(),Jia-heng TAO2,Jia-kun LIANG2,Zhen-gong CAI2,*(),Cheng CHANG1,Lin QIAO3,Qiao-ni ZHOU3
1. College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
2. Colloge of Software Technology, Zhejiang University, Ningbo 315048, China
3. Information and Communication Branch, State Grid Liaoning Electric Power Supply Co. Ltd, Shenyang 110006, China
 全文: PDF(807 KB)   HTML
摘要:

为了调查机器学习相关主题分布和发展趋势,从在线问答网站Stack Overflow上,利用过滤标签,从4 178多万帖子中提取出60 028个与机器学习相关的问题帖. 通过分析问题帖,统计各个机器学习平台的讨论量,发现Scikit-learn、TensorFlow、Keras是前3位频繁被讨论的机器学习平台,占总讨论量的58%. 为了进一步分析机器学习相关讨论主题,进行潜在狄利克雷分布(LDA)主题模型训练,提出自适应LDA中的主题数渐进搜索方法,采用主题一致性系数评估输出结果,获得主题最佳数量,从而发现9个讨论主题,分属3个类别:代码相关、模型相关、理论相关. 基于主题中问题帖的浏览数、评论数,分析不同主题的流行度和回答困难程度.

关键词: 实证研究机器学习Stack Overflow潜在狄利克雷分布(LDA)主题一致性    
Abstract:

By using filtered tags, 60 028 machine learning related questions were extracted from more than 41.78 million posts on an online Q & A website, Stack Overflow, in order to investigate the topic distribution and trends related to machine learning. Extracted question posts were analyzed by counting the amount of discussion on each machine learning platform, and top three most frequently discussed machine learning platforms were discovered, i.e. Scikit-learn, TensorFlow and Keras, accounting for 58% of these posts. Latent Dirichlet allocation (LDA) topic model training was conducted to further explore discussion topics related to machine learning. A progressive search approach was proposed for number of topics in adaptive LDA, which discovered the optimal number of topics with topic coherence coefficient, in order to obtain the optimal topic numbers for LDA models. Nine discussion topics related to machine learning were discovered, which fell into three broad categories, i.e. code-related, model-related, and theory-related. In addition, the popularity and difficulty of different topics were analyzed according to the view counts and comment counts of question posts.

Key words: empirical research    machine learning    Stack Overflow    latent Dirichlet allocation (LDA)    topic coherence
收稿日期: 2018-10-23 出版日期: 2019-05-17
CLC:  TP 311  
通讯作者: 才振功     E-mail: wanzhiyuan@zju.edu.cn;cstcaizg@zju.edu.cn
作者简介: 万志远(1984—),女,博士,从事软件工程、软件安全、程序语言相关研究. orcid.org/0000-0001-7657-6653. E-mail: wanzhiyuan@zju.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
万志远
陶嘉恒
梁家坤
才振功
苌程
乔林
周巧妮

引用本文:

万志远,陶嘉恒,梁家坤,才振功,苌程,乔林,周巧妮. Stack Overflow上机器学习相关问题的大规模实证研究[J]. 浙江大学学报(工学版), 2019, 53(5): 819-828.

Zhi-yuan WAN,Jia-heng TAO,Jia-kun LIANG,Zhen-gong CAI,Cheng CHANG,Lin QIAO,Qiao-ni ZHOU. Large-scale empirical study on machine learning related questions on Stack Overflow. Journal of ZheJiang University (Engineering Science), 2019, 53(5): 819-828.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2019.05.001        http://www.zjujournals.com/eng/CN/Y2019/V53/I5/819

名称 描述 名称 描述
Id 帖子的编号 LastEditorUserId 最近编辑帖子的用户ID(可选)
PostTypeId 帖子的类型,1表示问题帖,2表示答案帖,其他类型不予讨论 LastEditorDisplayName 最近编辑帖子的用户名(可选)
AcceptedAnswerId 问题帖对应的答案帖ID(可选,仅在PostTypeId=1时出现) LastEditDate 最近编辑帖子的时间(可选)
ParentId 答案帖对应的问题帖ID(可选,仅在PostTypeId=2时出现) LastActivityDate 帖子最近活动的时间
CreationDate 帖子的创建时间 Title 帖子标题(可选)
Score 游客对帖子的平均分 Tags 帖子标签(可选)
ViewCount 帖子的总浏览数(可选,仅在PostTypeId=1时出现) AnswerCount 帖子的答案帖数量(可选,仅在PostTypeId=1时出现)
Body 帖子的正文,HTML格式 CommentCount 帖子的评论数量
OwnerUserId 帖子所有者的ID(可选) FavoriteCount 喜欢帖子的人数(可选,仅在PostTypeId=1时出现)
OwnerDisplayName 帖子所有者的用户名(可选) ClosedDate 帖子关闭的时间(可选,仅在帖子关闭时出现)
表 1  帖子包含的元数据信息
图 1  Stack Overflow上机器学习相关帖子示例
T1, T2 标签集 标签数
(0.150,0.025) classification,deep-learning,scikit-learn,neural-network,artificial-intelligence,machine-learning,keras,svm,conv-neural-network,weka 10
(0.150,0.030) classification,deep-learning,scikit-learn,neural-network,artificial-intelligence,machine-learning,keras,svm 8
(0.250,0.025) classification,deep-learning,scikit-learn,neural-network,machine-learning,svm,conv-neural-network,weka 8
(0.250,0.030) classification,deep-learning,scikit-learn,neural-network,machine-learning,svm 6
(0.350,0.025)或(0.350,0.030) classification,machine-learning,svm 3
(>0.400,>0.080) machine-learning 1
表 2  根据不同阈值配置得到的“machine-learning”相关结果标签集
图 2  最佳主题数搜索算法过程中一致性系数的分布
序号 机器学习平台 讨论量 占比/%
1 Scikit-learn 12 641 21.06
2 TensorFlow 11 524 19.20
3 Keras 10 842 18.06
4 Caffe 1 501 2.50
5 Torch 1 366 2.28
6 Theano 1 300 2.17
7 Microsoft Cognitive Toolkit
(CNTK)
583 0.97
8 AWS 567 0.94
9 XGBoost 373 0.62
10 Caret 344 0.57
11 Spark ML&MLlib 339 0.56
12 Watson 286 0.48
13 Tflearn 224 0.37
14 Mahout 223 0.37
15 H2o.ai 209 0.35
16 Blocks 198 0.33
17 MXNet 155 0.26
18 Deeplearning4j 119 0.20
19 SINGA 98 0.16
20 Accord.NET 84 0.14
表 3  不同机器学习平台的讨论帖的数量
图 3  不同机器学习平台讨论帖数量占比
序号 主题名(类别) 主题关键词(提取词干后)
1 代码运行(代码相关) use,error,code,run,tri,model,get,python,imag,train
2 模型训练(模型相关) model,imag,train,use,kera,layer,data,tensorflow,size,batch
3 数据集分类(模型相关) data,use,train,class,classifi,test,set,classif,model,dataset
4 神经网络(模型相关) network,neural,output,layer,loss,function,train,use,weight,input
5 模型性能评估(模型相关) use,predict,featur,model,word,data,time,text,score,valu
6 实现细节(理论相关) use,gradient,function,implement,learn,x,vector,calcul,algorithm,comput
7 编程与库(代码相关) file,line,py,packag,python,tensorflow,error,lib,c,site
8 模型输入问题(模型相关) input,error,shape,kera,lstm,array,model,use,sequenc,tri
9 学习算法(理论相关) algorithm,use,would,like,learn,tree,valu,one,data,problem
表 4  9个主题的归纳主题名和相关性前10的关键词
序号 标题 说明 主题概率
1 Captcha Recognition using CNN not giving expected results 题主尝试在Python中使用TensorFlow开发验证码识别程序,运行没有得到预期结果 代码运行:0.46
实现细节:0.31
2 How to create Training data for Text classification on 4 categories 题主询问如何为4个类别的文本分类创建训练数据 模型训练:0.42
学习算法:0.38
3 How to use glmnet in R for classification problems 题主想用R中的glmnet来做分类问题,并给出了样本数据和要求 数据集分类:0.92
4 Properly declaring input_shape for neural network in Keras? 题主描述在Keras中遇到的声明神经网络参数input_shape的问题 神经网络:0.73
代码运行:0.18
5 R and PCA Explanation for machine learning 题主希望解释机器学习过程中发生了什么,以阿尔茨海默病数据为例,构建2个预测模型,评估其准确度 模型性能评估:0.39
模型训练:0.37
6 Keras ImageDataGenerator for Cloud ML Engine 题主试图调用Keras的flow_from_directory方法来处理云存储中的图像,但读取数据一直失败 实现细节:0.77
编程与库:0.22
7 ModuleNotFoundError: No module named 'keras' in AI DevCloud Intel 题主在装完Keras和TensorFlow环境后,在调用ImageDataGenerator包时仍提示Keras库未安装 编程与库:0.90
8 How to make this neural network traning function faster 题主输入锯齿状阵列的变量,每个处理阶段至少需要
5 min,询问训练更快的方法
模型输入问题:0.49
代码运行:0.26
9 How to classify documents using Naive Bayes and Principal Component Analysis (C#,Accord.NET) 题主想要学习如何使用朴素贝叶斯和主成分分析对文档进行分类 学习算法:0.45
模型训练:0.21
表 5  随机抽样问题帖验证主题名
图 4  9个主题对应的问题帖数
主题名(所属类别) V C F S
学习算法(理论相关) 1 509.89 1.92 4.38 2.64
数据集分类(模型相关) 1 281.56 1.61 2.38 1.46
实现细节(理论相关) 1 230.19 1.42 2.74 1.92
代码运行(代码相关) 1 150.17 1.72 2.11 1.38
神经网络(模型相关) 971.61 1.55 2.46 1.67
编程与库(代码相关) 968.26 1.71 1.38 0.93
模型性能评估(模型相关) 912.53 1.46 2.56 1.42
模型输入问题(模型相关) 888.61 1.41 2.06 1.39
模型训练(模型相关) 836.63 1.25 2.30 1.51
平均值 1 088.27 1.56 2.49 1.59
表 6  基于4项评估指标的主题流行度
图 5  各主题的问题帖数逐年变化趋势
主题名(所属类别) Δt/d PD/%
实现细节(理论相关) 14.40 0. 08
代码运行(代码相关) 14.23 0.10
学习算法(理论相关) 10.75 0.12
神经网络(模型相关) 10.41 0.09
模型性能评估(模型相关) 9.85 0.14
编程与库(代码相关) 9.27 0.11
数据集分类(模型相关) 8.75 0.09
模型训练(模型相关) 5.54 0.10
模型输入问题(模型相关) 4.21 0.11
表 7  基于2个指标(问题回答的平均时间跨度、答案量与浏览量的比例)的主题难度
1 GY?NGYI Z, KOUTRIKA G, PEDERSEN J, et al. Questioning yahoo! answers [R]. Stanford: Stanford InfoLab, 2007.
2 ADAMIC L A, ZHANG J, BAKSHY E, et al. Knowledge sharing and yahoo answers: everyone knows something [C]// Proceedings of the 17th International Conference on World Wide Web. Beijing: ACM, 2008: 665–674.
3 BARUA A, THOMAS S W, HASSAN A E What are developers talking about? an analysis of topics and trends in stack overflow[J]. Empirical Software Engineering, 2014, 19 (3): 619- 654
doi: 10.1007/s10664-012-9231-y
4 ROSEN C, SHIHAB E What are mobile developers asking about? a large scale study using stack overflow[J]. Empirical Software Engineering, 2016, 21 (3): 1192- 1223
doi: 10.1007/s10664-015-9379-3
5 LINARES-VáSQUEZ M, DIT B, POSHYVANYK D. An exploratory analysis of mobile development issues using stack overflow [C]// 10th IEEE Working Conference on Mining Software Repositories. San Francisco: IEEE, 2013: 93–96.
6 YANG X L, LO D, XIA X, et al What security questions do developers ask? a large-scale study of stack overflow posts[J]. Journal of Computer Science and Technology, 2016, 31 (5): 910- 924
doi: 10.1007/s11390-016-1672-0
7 BEYER S, PINZGER M. A manual categorization of android app development issues on Stack Overflow [C]// International Conference on Software Maintenance and Evolution. Victoria: IEEE, 2014: 531–535.
8 NADI S, KRüGER S, MEZINI M, et al. Jumping through hoops: why do Java developers struggle with cryptography APIs? [C]// Proceedings of the 38th International Conference on Software Engineering. Texas: ACM, 2016: 935–946.
9 HINDLE A, GODFREY M W, HOLT R C. What's hot and what's not: windowed developer topic analysis [C]// IEEE International Conference on Software Maintenance. Edmonton: IEEE, 2009: 339–348.
10 NEUHAUS S, ZIMMERMANN T. Security trend analysis with cve topic models [C]// 21st International Symposium on Software Reliability Engineering. San Jose: IEEE, 2010: 111–120.
11 THOMAS S W, ADAMS B, HASSAN A E, et al. Modeling the evolution of topics in source code histories [C]// Proceedings of the 8th Working Conference on Mining Software Repositories. Hawaii: ACM, 2011: 173–182.
12 TREUDE C, BARZILAY O, STOREY M A. How do programmers ask and answer questions on the web? Nier track [C]// 33rd International Conference on Software Engineering. Hawaii: IEEE, 2011: 804–807.
13 MAMYKINA L, MANOIM B, MITTAL M, et al. Design lessons from the fastest Q&A site in the west [C]// Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Vancouver: ACM, 2011: 2857–2866.
14 XIA X, LO D, WANG X, et al. Tag recommendation in software information sites [C]// 10th IEEE Working Conference on Mining Software Repositories. New Jersey: IEEE, 2013: 287–296.
15 WANG S, LO D, VASILESCU B, et al. EnTagRec: an enhanced tag recommendation system for software information sites [C]// International Conference on Software Maintenance and Evolution. Victoria: IEEE, 2014: 291–300.
16 ASUNCION H U, ASUNCION A U, TAYLOR R N. Software traceability with topic modeling [C]// 32nd International Conference on Software Engineering. Cap Town: IEEE, 2010, 1: 95–104.
17 THOMAS S W. Mining software repositories using topic models [C]// Proceedings of the 33rd International Conference on Software Engineering. Hawaii: ACM, 2011: 1138–1139.
18 PANICHELLA A, DIT B, OLIVETO R, et al. How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms [C]// Proceedings of the 2013 International Conference on Software Engineering. San Francisco: IEEE, 2013: 522–531.
19 GEORGE H. Parameter estimation for text analysis [R]. Darmstadt: University of Leipzig, 2009.
[1] 战友,李强,马啸天,王郴平,邱延峻. 基于宏微观纹理特征融合的路面摩擦性能预测[J]. 浙江大学学报(工学版), 2021, 55(4): 684-694.
[2] 于勇,薛静远,戴晟,鲍强伟,赵罡. 机加零件质量预测与工艺参数优化方法[J]. 浙江大学学报(工学版), 2021, 55(3): 441-447.
[3] 陈巧红,陈翊,李文书,贾宇波. 多尺度SE-Xception服装图像分类[J]. 浙江大学学报(工学版), 2020, 54(9): 1727-1735.
[4] 王慧芳,张晨宇. 采用极限梯度提升算法的电力系统电压稳定裕度预测[J]. 浙江大学学报(工学版), 2020, 54(3): 606-613.
[5] 谢乐,衡熙丹,刘洋,蒋启龙,刘东. 基于线性判别分析和分步机器学习的变压器故障诊断[J]. 浙江大学学报(工学版), 2020, 54(11): 2266-2272.
[6] 柯懂湘,潘丽敏,罗森林,张寒青. 基于随机森林算法的Android恶意行为识别与分类方法[J]. 浙江大学学报(工学版), 2019, 53(10): 2013-2023.
[7] 忽丽莎, 王素贞, 陈益强, 高晨龙, 胡春雨, 蒋鑫龙, 陈振宇, 高兴宇. 基于可穿戴设备的跌倒检测算法综述[J]. 浙江大学学报(工学版), 2018, 52(9): 1717-1728.
[8] 王洪凯, 陈中华, 周纵苇, 李迎辞, 陆佩欧, 王文志, 刘宛予, 于丽娟. 机器学习算法诊断PET/CT纵膈淋巴结性能评估[J]. 浙江大学学报(工学版), 2018, 52(4): 788-797.
[9] 吴鹏洲,于慧敏,曾雄. 基于正则化风险最小化的目标计数[J]. 浙江大学学报(工学版), 2014, 48(7): 1226-1233.
[10] 宓云軿 王晓萍 金鑫. 基于机器学习的水质COD预测方法[J]. J4, 2008, 42(5): 790-793.
[11] 黄启春 刘仰光 何钦铭. 基于支持向量机的增量式算法[J]. J4, 2008, 42(12): 2121-2126.