Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2019, Vol. 53 Issue (5): 819-828    DOI: 10.3785/j.issn.1008-973X.2019.05.001
    
Large-scale empirical study on machine learning related questions on Stack Overflow
Zhi-yuan WAN1(),Jia-heng TAO2,Jia-kun LIANG2,Zhen-gong CAI2,*(),Cheng CHANG1,Lin QIAO3,Qiao-ni ZHOU3
1. College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
2. Colloge of Software Technology, Zhejiang University, Ningbo 315048, China
3. Information and Communication Branch, State Grid Liaoning Electric Power Supply Co. Ltd, Shenyang 110006, China
Download: HTML     PDF(807KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

By using filtered tags, 60 028 machine learning related questions were extracted from more than 41.78 million posts on an online Q & A website, Stack Overflow, in order to investigate the topic distribution and trends related to machine learning. Extracted question posts were analyzed by counting the amount of discussion on each machine learning platform, and top three most frequently discussed machine learning platforms were discovered, i.e. Scikit-learn, TensorFlow and Keras, accounting for 58% of these posts. Latent Dirichlet allocation (LDA) topic model training was conducted to further explore discussion topics related to machine learning. A progressive search approach was proposed for number of topics in adaptive LDA, which discovered the optimal number of topics with topic coherence coefficient, in order to obtain the optimal topic numbers for LDA models. Nine discussion topics related to machine learning were discovered, which fell into three broad categories, i.e. code-related, model-related, and theory-related. In addition, the popularity and difficulty of different topics were analyzed according to the view counts and comment counts of question posts.



Key wordsempirical research      machine learning      Stack Overflow      latent Dirichlet allocation (LDA)      topic coherence     
Received: 23 October 2018      Published: 17 May 2019
CLC:  TP 311  
Corresponding Authors: Zhen-gong CAI     E-mail: wanzhiyuan@zju.edu.cn;cstcaizg@zju.edu.cn
Cite this article:

Zhi-yuan WAN,Jia-heng TAO,Jia-kun LIANG,Zhen-gong CAI,Cheng CHANG,Lin QIAO,Qiao-ni ZHOU. Large-scale empirical study on machine learning related questions on Stack Overflow. Journal of ZheJiang University (Engineering Science), 2019, 53(5): 819-828.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2019.05.001     OR     http://www.zjujournals.com/eng/Y2019/V53/I5/819


Stack Overflow上机器学习相关问题的大规模实证研究

为了调查机器学习相关主题分布和发展趋势,从在线问答网站Stack Overflow上,利用过滤标签,从4 178多万帖子中提取出60 028个与机器学习相关的问题帖. 通过分析问题帖,统计各个机器学习平台的讨论量,发现Scikit-learn、TensorFlow、Keras是前3位频繁被讨论的机器学习平台,占总讨论量的58%. 为了进一步分析机器学习相关讨论主题,进行潜在狄利克雷分布(LDA)主题模型训练,提出自适应LDA中的主题数渐进搜索方法,采用主题一致性系数评估输出结果,获得主题最佳数量,从而发现9个讨论主题,分属3个类别:代码相关、模型相关、理论相关. 基于主题中问题帖的浏览数、评论数,分析不同主题的流行度和回答困难程度.


关键词: 实证研究,  机器学习,  Stack Overflow,  潜在狄利克雷分布(LDA),  主题一致性 
名称 描述 名称 描述
Id 帖子的编号 LastEditorUserId 最近编辑帖子的用户ID(可选)
PostTypeId 帖子的类型,1表示问题帖,2表示答案帖,其他类型不予讨论 LastEditorDisplayName 最近编辑帖子的用户名(可选)
AcceptedAnswerId 问题帖对应的答案帖ID(可选,仅在PostTypeId=1时出现) LastEditDate 最近编辑帖子的时间(可选)
ParentId 答案帖对应的问题帖ID(可选,仅在PostTypeId=2时出现) LastActivityDate 帖子最近活动的时间
CreationDate 帖子的创建时间 Title 帖子标题(可选)
Score 游客对帖子的平均分 Tags 帖子标签(可选)
ViewCount 帖子的总浏览数(可选,仅在PostTypeId=1时出现) AnswerCount 帖子的答案帖数量(可选,仅在PostTypeId=1时出现)
Body 帖子的正文,HTML格式 CommentCount 帖子的评论数量
OwnerUserId 帖子所有者的ID(可选) FavoriteCount 喜欢帖子的人数(可选,仅在PostTypeId=1时出现)
OwnerDisplayName 帖子所有者的用户名(可选) ClosedDate 帖子关闭的时间(可选,仅在帖子关闭时出现)
Tab.1 Metadata information in a post
Fig.1 An example of machine-learning-related post on Stack Overflow
T1, T2 标签集 标签数
(0.150,0.025) classification,deep-learning,scikit-learn,neural-network,artificial-intelligence,machine-learning,keras,svm,conv-neural-network,weka 10
(0.150,0.030) classification,deep-learning,scikit-learn,neural-network,artificial-intelligence,machine-learning,keras,svm 8
(0.250,0.025) classification,deep-learning,scikit-learn,neural-network,machine-learning,svm,conv-neural-network,weka 8
(0.250,0.030) classification,deep-learning,scikit-learn,neural-network,machine-learning,svm 6
(0.350,0.025)或(0.350,0.030) classification,machine-learning,svm 3
(>0.400,>0.080) machine-learning 1
Tab.2 "machine-learning" related tag set from different threshold configurations
Fig.2 Distribution of coherence coefficient in process of optimal topic number search algorithm
序号 机器学习平台 讨论量 占比/%
1 Scikit-learn 12 641 21.06
2 TensorFlow 11 524 19.20
3 Keras 10 842 18.06
4 Caffe 1 501 2.50
5 Torch 1 366 2.28
6 Theano 1 300 2.17
7 Microsoft Cognitive Toolkit
(CNTK)
583 0.97
8 AWS 567 0.94
9 XGBoost 373 0.62
10 Caret 344 0.57
11 Spark ML&MLlib 339 0.56
12 Watson 286 0.48
13 Tflearn 224 0.37
14 Mahout 223 0.37
15 H2o.ai 209 0.35
16 Blocks 198 0.33
17 MXNet 155 0.26
18 Deeplearning4j 119 0.20
19 SINGA 98 0.16
20 Accord.NET 84 0.14
Tab.3 Number of discussion posts about different machine learning platforms
Fig.3 Proportion of number of discussion posts about different machine learning platforms
序号 主题名(类别) 主题关键词(提取词干后)
1 代码运行(代码相关) use,error,code,run,tri,model,get,python,imag,train
2 模型训练(模型相关) model,imag,train,use,kera,layer,data,tensorflow,size,batch
3 数据集分类(模型相关) data,use,train,class,classifi,test,set,classif,model,dataset
4 神经网络(模型相关) network,neural,output,layer,loss,function,train,use,weight,input
5 模型性能评估(模型相关) use,predict,featur,model,word,data,time,text,score,valu
6 实现细节(理论相关) use,gradient,function,implement,learn,x,vector,calcul,algorithm,comput
7 编程与库(代码相关) file,line,py,packag,python,tensorflow,error,lib,c,site
8 模型输入问题(模型相关) input,error,shape,kera,lstm,array,model,use,sequenc,tri
9 学习算法(理论相关) algorithm,use,would,like,learn,tree,valu,one,data,problem
Tab.4 Inductive topic names for nine topics and top ten keywords for relevance
序号 标题 说明 主题概率
1 Captcha Recognition using CNN not giving expected results 题主尝试在Python中使用TensorFlow开发验证码识别程序,运行没有得到预期结果 代码运行:0.46
实现细节:0.31
2 How to create Training data for Text classification on 4 categories 题主询问如何为4个类别的文本分类创建训练数据 模型训练:0.42
学习算法:0.38
3 How to use glmnet in R for classification problems 题主想用R中的glmnet来做分类问题,并给出了样本数据和要求 数据集分类:0.92
4 Properly declaring input_shape for neural network in Keras? 题主描述在Keras中遇到的声明神经网络参数input_shape的问题 神经网络:0.73
代码运行:0.18
5 R and PCA Explanation for machine learning 题主希望解释机器学习过程中发生了什么,以阿尔茨海默病数据为例,构建2个预测模型,评估其准确度 模型性能评估:0.39
模型训练:0.37
6 Keras ImageDataGenerator for Cloud ML Engine 题主试图调用Keras的flow_from_directory方法来处理云存储中的图像,但读取数据一直失败 实现细节:0.77
编程与库:0.22
7 ModuleNotFoundError: No module named 'keras' in AI DevCloud Intel 题主在装完Keras和TensorFlow环境后,在调用ImageDataGenerator包时仍提示Keras库未安装 编程与库:0.90
8 How to make this neural network traning function faster 题主输入锯齿状阵列的变量,每个处理阶段至少需要
5 min,询问训练更快的方法
模型输入问题:0.49
代码运行:0.26
9 How to classify documents using Naive Bayes and Principal Component Analysis (C#,Accord.NET) 题主想要学习如何使用朴素贝叶斯和主成分分析对文档进行分类 学习算法:0.45
模型训练:0.21
Tab.5 Random sampling of question posts to verify topic names
Fig.4 Numbers of question posts corresponding to nine topics
主题名(所属类别) V C F S
学习算法(理论相关) 1 509.89 1.92 4.38 2.64
数据集分类(模型相关) 1 281.56 1.61 2.38 1.46
实现细节(理论相关) 1 230.19 1.42 2.74 1.92
代码运行(代码相关) 1 150.17 1.72 2.11 1.38
神经网络(模型相关) 971.61 1.55 2.46 1.67
编程与库(代码相关) 968.26 1.71 1.38 0.93
模型性能评估(模型相关) 912.53 1.46 2.56 1.42
模型输入问题(模型相关) 888.61 1.41 2.06 1.39
模型训练(模型相关) 836.63 1.25 2.30 1.51
平均值 1 088.27 1.56 2.49 1.59
Tab.6 Popularity of each topic based on four evaluation indexes
Fig.5 Trends in number of question posts on each topic yearly
主题名(所属类别) Δt/d PD/%
实现细节(理论相关) 14.40 0. 08
代码运行(代码相关) 14.23 0.10
学习算法(理论相关) 10.75 0.12
神经网络(模型相关) 10.41 0.09
模型性能评估(模型相关) 9.85 0.14
编程与库(代码相关) 9.27 0.11
数据集分类(模型相关) 8.75 0.09
模型训练(模型相关) 5.54 0.10
模型输入问题(模型相关) 4.21 0.11
Tab.7 Difficulty of each topic based on two evaluation indexes (average time span of answering question and ratio of answer count to view count)
[1]   GY?NGYI Z, KOUTRIKA G, PEDERSEN J, et al. Questioning yahoo! answers [R]. Stanford: Stanford InfoLab, 2007.
[2]   ADAMIC L A, ZHANG J, BAKSHY E, et al. Knowledge sharing and yahoo answers: everyone knows something [C]// Proceedings of the 17th International Conference on World Wide Web. Beijing: ACM, 2008: 665–674.
[3]   BARUA A, THOMAS S W, HASSAN A E What are developers talking about? an analysis of topics and trends in stack overflow[J]. Empirical Software Engineering, 2014, 19 (3): 619- 654
doi: 10.1007/s10664-012-9231-y
[4]   ROSEN C, SHIHAB E What are mobile developers asking about? a large scale study using stack overflow[J]. Empirical Software Engineering, 2016, 21 (3): 1192- 1223
doi: 10.1007/s10664-015-9379-3
[5]   LINARES-VáSQUEZ M, DIT B, POSHYVANYK D. An exploratory analysis of mobile development issues using stack overflow [C]// 10th IEEE Working Conference on Mining Software Repositories. San Francisco: IEEE, 2013: 93–96.
[6]   YANG X L, LO D, XIA X, et al What security questions do developers ask? a large-scale study of stack overflow posts[J]. Journal of Computer Science and Technology, 2016, 31 (5): 910- 924
doi: 10.1007/s11390-016-1672-0
[7]   BEYER S, PINZGER M. A manual categorization of android app development issues on Stack Overflow [C]// International Conference on Software Maintenance and Evolution. Victoria: IEEE, 2014: 531–535.
[8]   NADI S, KRüGER S, MEZINI M, et al. Jumping through hoops: why do Java developers struggle with cryptography APIs? [C]// Proceedings of the 38th International Conference on Software Engineering. Texas: ACM, 2016: 935–946.
[9]   HINDLE A, GODFREY M W, HOLT R C. What's hot and what's not: windowed developer topic analysis [C]// IEEE International Conference on Software Maintenance. Edmonton: IEEE, 2009: 339–348.
[10]   NEUHAUS S, ZIMMERMANN T. Security trend analysis with cve topic models [C]// 21st International Symposium on Software Reliability Engineering. San Jose: IEEE, 2010: 111–120.
[11]   THOMAS S W, ADAMS B, HASSAN A E, et al. Modeling the evolution of topics in source code histories [C]// Proceedings of the 8th Working Conference on Mining Software Repositories. Hawaii: ACM, 2011: 173–182.
[12]   TREUDE C, BARZILAY O, STOREY M A. How do programmers ask and answer questions on the web? Nier track [C]// 33rd International Conference on Software Engineering. Hawaii: IEEE, 2011: 804–807.
[13]   MAMYKINA L, MANOIM B, MITTAL M, et al. Design lessons from the fastest Q&A site in the west [C]// Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Vancouver: ACM, 2011: 2857–2866.
[14]   XIA X, LO D, WANG X, et al. Tag recommendation in software information sites [C]// 10th IEEE Working Conference on Mining Software Repositories. New Jersey: IEEE, 2013: 287–296.
[15]   WANG S, LO D, VASILESCU B, et al. EnTagRec: an enhanced tag recommendation system for software information sites [C]// International Conference on Software Maintenance and Evolution. Victoria: IEEE, 2014: 291–300.
[16]   ASUNCION H U, ASUNCION A U, TAYLOR R N. Software traceability with topic modeling [C]// 32nd International Conference on Software Engineering. Cap Town: IEEE, 2010, 1: 95–104.
[17]   THOMAS S W. Mining software repositories using topic models [C]// Proceedings of the 33rd International Conference on Software Engineering. Hawaii: ACM, 2011: 1138–1139.
[18]   PANICHELLA A, DIT B, OLIVETO R, et al. How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms [C]// Proceedings of the 2013 International Conference on Software Engineering. San Francisco: IEEE, 2013: 522–531.
[19]   GEORGE H. Parameter estimation for text analysis [R]. Darmstadt: University of Leipzig, 2009.
[1] You ZHAN,Qiang LI,Xiao-tian MA,Chen-ping WANG,Yan-jun QIU. Macro and micro texture based prediction of pavement surface friction[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 684-694.
[2] Yong YU,Jing-yuan XUE,Sheng DAI,Qiang-wei BAO,Gang ZHAO. Quality prediction and process parameter optimization method for machining parts[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 441-447.
[3] Qiao-hong CHEN,YI CHEN,Wen-shu Li,Yu-bo JIA. Clothing image classification based on multi-scale SE-Xception[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(9): 1727-1735.
[4] Hui-fang WANG,Chen-yu ZHANG. Prediction of voltage stability margin in power system based on extreme gradient boosting algorithm[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(3): 606-613.
[5] Le XIE,Xi-dan HENG,Yang LIU,Qi-long JIANG,Dong LIU. Transformer fault diagnosis based on linear discriminant analysis and step-by-step machine learning[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(11): 2266-2272.
[6] Dong-xiang KE,Li-min PAN,Sen-lin LUO,Han-qing ZHANG. Android malicious behavior recognition and classification method based on random forest algorithm[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(10): 2013-2023.
[7] HU Li-sha, WANG Su-zhen, CHEN Yi-qiang, GAO Chen-long, HU Chun-yu, JIANG Xin-long, CHEN Zhen-yu, GAO Xing-yu. Fall detection algorithms based on wearable device: a review[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(9): 1717-1728.
[8] WANG Hong-kai, CHEN Zhong-hua, ZHOU Zong-wei, LI Ying-ci, LU Pei-ou, WANG Wen-zhi, LIU Wan-yu, YU Li-juan. Evaluation of machine learning classifiers for diagnosing mediastinal lymph node metastasis of lung cancer from PET/CT images[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(4): 788-797.
[9] WANG Li-jun, HUANG Zhong-chao, ZHAO Yu-qian. New spatial-coherent latent topic model based on super-pixel segmentation and scene classification method[J]. Journal of ZheJiang University (Engineering Science), 2015, 49(3): 402-408.
[10] WU Peng-zhou, YU Hui-min, ZENG Xiong. Object counting based on regularized risk minimization[J]. Journal of ZheJiang University (Engineering Science), 2014, 48(7): 1226-1233.