A context-aware index based text extraction framework

doi:10.3785/j.issn.1008-973X.2013.09.004

2013, Vol. 47

Issue (9): 1537-1546 DOI: 10.3785/j.issn.1008-973X.2013.09.004

A context-aware index based text extraction framework

JIN Cang-hong1, WU Ming-hui2, YING Jing1,2

1. College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
2. Department of Computer Science and Engineering, Zhejiang University City College, Hangzhou 310015, China

Download:

PDF(0KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

In order to promote the efficiency of text extraction and the dynamicity of specified pattern and to support on-line extraction pattern definition, a novel extraction framework, that including context-aware indexes, context related extract language and match algorithms, was proposed. This framework can directly extract the query context regardless the text comparison, which improves the extraction performance. The analysis and experimental results show that this framework, compared with the documents parsing approaches and the inverted indexes based approaches, has stable running time and better performance on different sizes and formats corpuses. In additional, the influence of length of extraction pattern in this framework was low. Consequently, this framework supports on-line information extraction over large data corpus.

Published: 01 September 2013

CLC:

TP 311

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors

Cite this article:

JIN Cang-hong, WU Ming-hui, YING Jing. A context-aware index based text extraction framework. J4, 2013, 47(9): 1537-1546.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2013.09.004 OR http://www.zjujournals.com/eng/Y2013/V47/I9/1537

一种基于上下文索引的文本匹配框架

为了提高信息挖掘方法的效率和动态性,支持在线定义知识提取模式,提出一种快速文本匹配框架.该框架包括上下文索引、上下文挖掘语言、上下文匹配算法等模块.框架从索引中直接获得提取内容的上下文信息,无需依赖文本过滤,从而提高信息提取性能.理论分析和实验表明：本框架提取方法同文本提取方法、倒排提取方法等相比,其运行时间在不同大小和结构的数据集上更为稳定高效,提取模式的长度对本框架的影响较小,因此,适合海量数据的在线提取.

［1］ ALAN A S, LANG F M, An overview of MetaMap: historical perspective and recent advances ［J］. Journal of American Medical Informatics Association, 2010，17:229-236.
［2］ DEROSE P, SEHN W, CHEN F, et al. DBLife: a community information management platform for the database research community ［C］∥ Conference on Innovative Data Systems Research 2007.Asilomar, USA: ACM,2007.
［3］ GUERGANA K S, JAMES J M, PHILIP V O. Clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications ［J］. Journal of American Medical Informatics Association, 2010,17: 507-513.
［4］ COHEN W, MCCALLUM A, Information Extraction and Integration: an Overview ［C］∥ Proceedings of Ninth ACM SIGKDD Internaltional Conference on Knowledge Discovery and Data Mining. Washington,DC, USA: ACM, 2003.
［5］ KANDOGAN E, KRISHNAMURTHY R, RAGHAVAN S, et al. Avatar semantic search: a database approach to information retrieval［C］∥ Proceedings of Special Interest Group on Management of Data 2006. Chicago Illionis, USA： ACM,2006.
［6］张博,耿志华,周傲英,一种支持高效XML路径查询的自适应结构索引［J］.软件学报,2009,20(7):1812-1824．
ZHANG Bo, DI Zhi-hua, ZHOU Ao-ying. Adaptive structural index for efficient processing of XML path queries ［J］. Journal of Software,2009,20(7):1812-1824．
［7］ WANG R. Language-independent class instance extraction using the Web ［D］. Pittsburgh:Electrical and Computer Engineering, Carnegie Mellon University,2009.
［8］ MUSLEA I, MINTON S, KNOBLOCK C A. A hierarchical approach to wrapper induction ［C］∥ Proceedings of the International Conference on Autonomous Agents(AGENTS’99). New York, USA: ACM, 1999:190-197.
［9］ KUSHMERICK N, Wrapper induction for information extraction［D］. Washington: Department of Computer Science, University of Washington, 1997.
［10］ COHEN W W, HURST M, JENSEN L S. A flexible learning system for wrapping tables and lists in html documents ［C］∥ Proceedings of the 11th International World Wide Web Conf. (WWW’02).Hawaii, USA:ACM, 2002:232-241．
［11］ EMBLEY D W, JIANG Y, NG Y K, Record-boundary discovery in Web documents ［C］∥ Proceedings of ACM SIGMOD International Conference on Management of Data. Philadelphia, USA:ACM,1999:467-478．
［12］ IAN H W, ALISTAIR M, TIMOTHY C B.Compressing and indexing documents and images ［M］.San Francisco:Morgan Kaufmann Publishing, 1999．
［13］刘小珠,彭智勇.全文索引技术时空效率分析［J］. 软件学报,2009,20(7): 1768-1784．
LIU Xiao-zhu, PENG Zhi-yong. Time and space efficiencies analysis of full-text index techniques ［J］. Journal of Software, 2009,20(7):1768-1784．
［14］ CHENG T, YAN X, CHANG K. EntityRank: searching entities directly and holistically ［C］∥ International Conference on Very Large Data Bases. Vienna, Austria: ACM, 2007.
［15］ ZHOU M, CHENG T, CHANG K, Data-oriented content query system: searching for data into text on the web ［C］∥ International Conference on Web Search and Data Mining. Rome:\
[s.n.\], 2010.
［16］ MCCANDLESS M, HATCHER E, GOSPODNETIC O. Lucene in action ［M］. 2 eds.New York: Manning Publictaion Co., 2010.
［17］ DOAN A, RAMARKRISHNAN R, VAITHYANATHAN S. Managing information extraction ［C］∥ Proceedings of Special Interest Group on Management of Data 2006. Chicago Illionis, USA: ACM, 2006.
［18］ LOWE H J, BARNETT G O.MicroMeSH: a microcomputer system for searching and exploring the national library medicine’s medical subject headings (mesh) vocabulary ［J］. Proceedings of the Annual Symposium on Computer Application Medical Care, 1987, 11(4):717-720.
［19］ MILLER R A, GIESZCZYKIEWICZ F M, VRIES J K. CHARTLINE: pviding bibliographic references relevant to patient charts using the UMLS Metathesaurus knowledge sources ［J］. Proceedings of the Annual Symposium on Computer Application Medical Care, 1992(1)： 86-90.
［20］ HERSH W R, GREENES R A. SAPHIRE: an information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships ［J］. Comput BiomedRes,1990,23(5):410-425.
［21］ Newsgroups ［OL/EB］.［2012-05-25］. http:∥people.csail.mit.edu/ jrennie/20Newsgroups.
［22］ PATIL M, THANKACHAN S, SHAH R. Inverted indexes for phrases and strings ［C］∥ Conference on Research and Development in Information Retrieval.Beijing:ACM, 2011.
［23］ JUSTIN Z, ALISTAIR M. Inverted files for text search engines ［J］. ACM Computing Surveys,2006,38(2):156.
［24］ JUNGHOO C, SRIDHAR R. A Fast regular expression indexing engine ［C］∥ Proceedings 18th International Conference on Issue Date. Los Angeles, USA:IEEE,2002.
［25］刘小珠,彭智勇,陈旭.高效的随机访问分块倒排文件自索引技术［J］. 计算机学报,2010,33(6): 977987.
LIU Xiao-zhu, PENG Zhi-yong, CHEN Xu. An efficient random access block inverted file self-index technology ［J］. Chinese Journal of Computers, 2010,33(6): 977-987.
［26］ GONZALO N, MATHIEU R.柔性字符串匹配［M］.北京：电子工业出版社, 2007.
［27］ KNUTH D E, MORRIS J H, PRATT V R, Fast pattern matching in strings ［J］. SIAM Journal on Computing, 1977,6(1):323-350.
［28］ YAO A C. The complexity of pattern matching for a random string ［J］. SIAM Journal on Computing, 1979,8(3):368-387．
［29］ THOMPSON K. Regular expression search algorithm ［J］. Communications of the ACM, 1968,11:419-422．
［30］ GLUSKOV V M.The abstract theory of automata ［J］. Russian Mathematical Surveys, 1961,16:153．

[1]	KE Hai-feng, YING Jing. Real-time license character recognition technology based on R-ELM[J]. J4, 2014, 48(2): 0-0.

[2]	ZHU Fan-wei, WU Ming-hui, YING Jing. Faceted Web search approach for large scale unstructured data[J]. J4, 2013, 47(6): 990-999.

[3]	FENG Pei-en, LIU Yu, QIU Qing-ying, LI Li-xin. Strategies of efficiency improvement for Eclat algorithm[J]. J4, 2013, 47(2): 223-230.

[4]	LIU Ying, CHEN Ling, CHEN Gen-cai, ZHAO Jiang-qi, WANG Jing-chang. Approach for collection selection based on click-through data[J]. J4, 2013, 47(1): 23-28.

[5]	YIN Ting, XIAO Min, CHEN Ling, ZHAO Jiang-qi, WANG Jing-chang. CQPM based OLAP query log mining and recommendation[J]. J4, 2012, 46(11): 2052-2060.

[6]	XIAO Min, CHEN Iing, XIA Hai-yuan, CHEN Gen-cai. Data warehouse native feature based OLAP querying with keywords[J]. J4, 2012, 46(6): 974-979.

[7]	ZHANG Li-ping, LI Song, HAO Xiao-hong, HAO Zhong-xiao. Jrv rough Vague region relation[J]. J4, 2012, 46(1): 105-111.

[8]	CHEN Ling, XU Xiao-long, YANG Qing, CHEN Gen-cai. Wireless signal strength propagation model base on cubic spline interpolation[J]. J4, 2011, 45(9): 1521-1527.

[9]	WU Ming-hui, YING Jing. Business process modeling and formal verification[J]. J4, 2011, 45(2): 280-287.

[10]	FU Chao-yang, GAO Ji, ZHOU You-ming. Service discovery based on integrating lexical multi-level hashing with subsumption semantics[J]. J4, 2010, 44(12): 2274-2283.

[11]	YANG Qing, CHEN Ling, CHEN Gen-Cai. Estimating walking distance based on single accelerometer[J]. J4, 2010, 44(9): 1681-1686.

[12]	XIONG Wei, WANG Xiao-Tun. Method for mapping software dependability requirements based on quality function deployment[J]. J4, 2010, 44(5): 881-886.

[13]	ZHANG Yin, HE Gao, DIAO Li-Na, ZHANG San-Yuan. Abstract state machine design of Internetware model[J]. J4, 2010, 44(5): 923-929.

[14]	JIANG Chao, YING Jing, TUN Meng-Hui, et al. Feature increment oriented approach for software product line analysis[J]. J4, 2009, 43(12): 2142-2148.

[15]	CHEN Bin, TAO Min. Mining associated and item-item correlated frequent patterns[J]. J4, 2009, 43(12): 2171-2177.

Viewed

Full text

Abstract

Cited

Shared

Discussed