使用链接可视化，聚类和主动学习对监督机器学习进行多变量数据集的交互式标记

doi:10.1016/j.visinf.2019.03.002

Vis Inf

2019, Vol. 3

Issue (1): 9-17 DOI: 10.1016/j.visinf.2019.03.002

论文

使用链接可视化，聚类和主动学习对监督机器学习进行多变量数据集的交互式标记

Mohammad Chegini^a,b, Jürgen Bernard^c, Philip Berger^d, Alexei Sourin^b, Keith Andrews^a, Tobias Schreck^a

^aGraz University of Technology, Austria ^bSchool of Computer Science and Engineering, Nanyang Technological University, Singapore ^cTU Darmstadt, Germany ^dUniversity of Rostock, Germany

Interactive Labelling of a Multivariate Dataset for Supervised Machine Learning using Linked Visualisations, Clustering, and Active Learning

Mohammad Chegini^a,b, Jürgen Bernard^c, Philip Berger^d, Alexei Sourin^b, Keith Andrews^a, Tobias Schreck^a

^aGraz University of Technology, Austria ^bSchool of Computer Science and Engineering, Nanyang Technological University, Singapore ^cTU Darmstadt, Germany ^dUniversity of Rostock, Germany

全文: PDF

摘要：

监督机器学习技术需要标记多变量训练数据集。许多方法将机器学习算法与交互式可视化相结合来解决未标记数据集的问题。通过采用合适的技术，分析师可以在可高度交互的迭代式机器学习过程中发挥积极作用，实现对数据集的标记并构建有意义的划分。尽管这一思路已经在无监督、半监督或有监督的机器学习任务中得到实施，但将这三种方法组合到一起仍然具有挑战性。

本文提出了一种可视化分析方法，该方法将多种机器学习功能与四个链接的可视化视图集成到mVis系统中。通过技术调色板，分析人员可对多变量数据集进行探索性数据分析，实现有意义的标记分区，进而构建分类器。在这一过程中，分析师可以在主动学习支持的半监督过程中标记值得关注的模式或异常值。数据集被交互式标记后，分析师就可以通过有监督的机器学习继续后面的流程，来评估随后的分类器是否能有效体现标记过的训练数据集所表达的概念。通过采用自动选择维度的新技术，分析师可以对多变量数据集的维度进行交互来引导机器学习算法。

本文通过一个现实世界的足球数据集来展示mVis在执行多项分析和标记任务中的实用性，这些任务从初始标记过程中的迭代式数据探索、聚集、分类、通过主动学习来优化命名分区，到最终产生一个适用于训练分类器的、高质量标记的训练数据集。该工具为分析人员提供了交互式可视化功能，包括散点图，平行坐标，记录的相似性图，以及新的分区的相似性图。

关键词： 标签; 聚类; 分类; 主动学习; 多元数据; 可视化

Abstract: Supervised machine learning techniques require labelled multivariate training datasets. Many approaches address the issue of unlabeled datasets by tightly coupling machine learning algorithms with interactive visualisations. Using appropriate techniques, analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions. While this principle has been implemented either for unsupervised, semi-supervised, or supervised machine learning tasks, the combination of all three methodologies remains challenging. In this paper, a visual analytics approach is presented, combining a variety of machine learning capabilities with four linked visualisation views, all integrated within the mVis (multivariate Visualiser) system. The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions, from which a classifier can be built. In the workflow, the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning. Once a dataset has been interactively labelled, the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset. Using a novel technique called automatic dimension selection, interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms. A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks, from initial labelling through iterations of data exploration, clustering, classification, and active learning to refine the named partitions, to finally producing a high-quality labelled training dataset suitable for training a classifier. The tool empowers the analyst with interactive visualisations including scatterplots, parallel coordinates, similarity maps for records, and a new similarity map for partitions.

Key words: labelling clustering classification active learning multivariate data visualization

出版日期: 2019-03-15

通讯作者: Mohammad Chegini E-mail: m.chegini@cgv.tugraz.at

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	Mohammad Chegini
	Jürgen Bernard
	Philip Berger
	Alexei Sourin
	Keith Andrews
	Tobias Schreck

引用本文:

Mohammad Chegini, Jürgen Bernard, Philip Berger, Alexei Sourin, Keith Andrews, Tobias Schreck. Interactive Labelling of a Multivariate Dataset for Supervised Machine Learning using Linked Visualisations, Clustering, and Active Learning. Vis Inf, 2019, 3(1): 9-17.

链接本文:

http://www.zjujournals.com/vi/CN/10.1016/j.visinf.2019.03.002 或 http://www.zjujournals.com/vi/CN/Y2019/V3/I1/9

[1]	Shahid Latif, Fabian Beck. 概述双变量地理数据的交互式地图报告[J]. Vis Inf, 2019, 3(1): 27-37.
[2]	Chong Zhang, Yang Chen, Jing Yang, Zhengcong Yin. 一种基于关联规则的减少平行集视觉混乱的方法[J]. Vis Inf, 2019, 3(1): 48-57.
[3]	Malik Olivier Boussejra, Rikuo Uchiki, Yuriko Takeshima, Kazuya Matsubayashi, Shunya Takekawa, Makoto Uemura, Issei Fujishiro. aflak：一个在分析天文数据集时支持端到端起源管理的可视化编程环境[J]. Vis Inf, 2019, 3(1): 1-8.
[4]	Xiangyang He, Yubo Tao, Qirui Wang, Hai Lin. 多变量数据协同可视探索框架 [J]. Vis Inf, 2018, 2(4): 254-263.
[5]	Roger Almeida Leite, Theresia Gschwandtner, Silvia Miksch, Erich Gstrein, Johannes Kuntner. 采用可视化分析来检测欺诈事件 [J]. Vis Inf, 2018, 2(4): 198-212.
[6]	Vahan Yoghourdjian, Daniel Archambault, Stephan Diehl, Tim Dwyer, Karsten Klein, Helen C.Purchase, Hsiang-Yun Wu. 探索复杂性的极限:图可视化实例研究综述[J]. Vis Inf, 2018, 2(4): 264-282.
[7]	Aindrila Ghosh, Mona Nashaat, James Miller, Shaikh Quader, Chad Marston. 面向表格式工业数据集的探索性分析工具综述[J]. Vis Inf, 2018, 2(4): 235-253.
[8]	Jenny Vuong, Sandeep Kaur, Julian Heinrich, Bosco K.Ho, Christopher J.Hammang, Benedetta F.Baldi, Seán I.O’Donoghue. Versus——使用2AFC方法评估可视化和图像质量的工具[J]. Vis Inf, 2018, 2(4): 225-234.
[9]	Takanori Fujiwara, Tarik Crnovrsanin, Kwan-Liu Ma. 交互式网络分析过程的简明概括 [J]. Vis Inf, 2018, 2(4): 213-224.
[10]	Indratmo, LeeHoworko, Joyce MariaBoedianto, BenDaniel. 采用堆叠条形图进行单个属性和整体属性比较的有效性 [J]. Vis Inf, 2018, 2(3): 155-165.
[11]	Christopher Collins, Natalia Andrienko, TobiasSchreck, JingYang, Jaegul Choo, Ulrich Engelke, Amit Jena, Tim Dwyer. 人机分析过程导引 [J]. Vis Inf, 2018, 2(3): 166-180.
[12]	Rulei Yu, Lei Shi. 深度学习可视化综述：面向用户群体分类 [J]. Vis Inf, 2018, 2(3): 147-154.
[13]	DeqingLi, HonghuiMei, YiShen, ShuangSu, WenliZhang, JuntingWang, MingZu, WeiChen. ECharts: 是一款开源的、基于 web 的、跨平台的支持快速创建交互式可视化的框架[J]. Vis Inf, 2018, 2(2): 136-146.
[14]	Maha El Meseery, Orland Hoeber. 地理协同平行坐标（GCPC）：环境数据分析的现场试验研究 [J]. Vis Inf, 2018, 2(2): 111-124.
[15]	Honghui Mei, Wei Chen, Yuxin Ma, HuihuaGua, Wanqi Hu. VisComposer:面向信息可视化的可编程集成开发环境 [J]. Vis Inf, 2018, 2(1): 71-81.

Viewed

Full text

Abstract

Cited

Shared

Discussed