Please wait a minute...
J4  2013, Vol. 47 Issue (9): 1537-1546    DOI: 10.3785/j.issn.1008-973X.2013.09.004
1. 浙江大学 计算机学院, 浙江 杭州 310027;2. 浙江大学城市学院 计算机科学与工程学系, 浙江 杭州 310015
A context-aware index based text extraction framework
JIN Cang-hong1, WU Ming-hui2, YING Jing1,2
1. College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
2. Department of Computer Science and Engineering, Zhejiang University City College, Hangzhou 310015, China
 全文: PDF  HTML



In order to promote the efficiency of text extraction and the dynamicity of specified pattern and to support on-line extraction pattern definition, a novel extraction framework, that including context-aware indexes, context related extract language and match algorithms, was proposed. This framework can directly extract the query context regardless the text comparison, which improves the extraction performance. The analysis and experimental results show that this framework, compared with the documents parsing approaches and the inverted indexes based approaches, has stable running time and better performance on different sizes and formats corpuses. In additional, the influence of length of extraction pattern in this framework was low. Consequently, this framework supports on-line information extraction over large data corpus.

出版日期: 2013-09-01
:  TP 311  


通讯作者: 吴明晖,男,教授.     E-mail:
作者简介: 金苍宏(1982-),男,博士生,从事信息检索、软件工程研究
E-mail Alert


金苍宏,吴明晖,应晶. 一种基于上下文索引的文本匹配框架[J]. J4, 2013, 47(9): 1537-1546.

JIN Cang-hong, WU Ming-hui, YING Jing. A context-aware index based text extraction framework. J4, 2013, 47(9): 1537-1546.


[1] ALAN A S, LANG F M, An overview of MetaMap: historical perspective and recent advances [J]. Journal of American Medical Informatics Association, 2010,17:229-236.
[2] DEROSE P, SEHN W, CHEN F, et al. DBLife: a community information management platform for the database research community [C]∥ Conference on Innovative Data Systems Research 2007.Asilomar, USA: ACM,2007.
[3] GUERGANA K S, JAMES J M, PHILIP V O. Clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications [J]. Journal of American Medical Informatics Association, 2010,17: 507-513.
[4] COHEN W, MCCALLUM A, Information Extraction and Integration: an Overview [C]∥ Proceedings of Ninth ACM SIGKDD Internaltional Conference on Knowledge Discovery and Data Mining. Washington,DC, USA: ACM, 2003.
[5] KANDOGAN E, KRISHNAMURTHY R, RAGHAVAN S, et al. Avatar semantic search: a database approach to information retrieval[C]∥ Proceedings of Special Interest Group on Management of Data 2006. Chicago Illionis, USA: ACM,2006.
[6] 张博,耿志华,周傲英,一种支持高效XML路径查询的自适应结构索引[J].软件学报,2009,20(7):1812-1824.
ZHANG Bo, DI Zhi-hua, ZHOU Ao-ying. Adaptive structural index for efficient processing of XML path queries [J]. Journal of Software,2009,20(7):1812-1824.
[7] WANG R. Language-independent class instance extraction using the Web [D]. Pittsburgh:Electrical and Computer Engineering, Carnegie Mellon University,2009.
[8] MUSLEA I, MINTON S, KNOBLOCK C A. A hierarchical approach to wrapper induction [C]∥ Proceedings of the International Conference on Autonomous Agents(AGENTS’99). New York, USA: ACM, 1999:190-197.
[9] KUSHMERICK N, Wrapper  induction for information extraction[D]. Washington: Department of Computer Science, University of Washington, 1997.
[10] COHEN W W, HURST M, JENSEN L S. A flexible learning system for wrapping tables and lists in html documents [C]∥ Proceedings of the 11th International World Wide Web Conf. (WWW’02).Hawaii, USA:ACM, 2002:232-241.
[11] EMBLEY D W, JIANG Y, NG Y K, Record-boundary discovery in Web documents [C]∥ Proceedings of ACM SIGMOD International Conference on Management of Data. Philadelphia, USA:ACM,1999:467-478.
[12] IAN H W, ALISTAIR M, TIMOTHY C B.Compressing and indexing documents and images [M].San Francisco:Morgan Kaufmann Publishing, 1999.
[13] 刘小珠,彭智勇.全文索引技术时空效率分析 [J]. 软件学报,2009,20(7): 1768-1784.
LIU Xiao-zhu, PENG Zhi-yong. Time and space efficiencies analysis of full-text index techniques [J]. Journal of Software, 2009,20(7):1768-1784.
[14] CHENG T, YAN X,  CHANG K. EntityRank: searching entities directly and holistically [C]∥ International Conference on Very Large Data Bases. Vienna, Austria: ACM, 2007.
[15] ZHOU M, CHENG T, CHANG K, Data-oriented content query system: searching for data into text on the web [C]∥ International Conference on Web Search and Data Mining. Rome:\
[s.n.\], 2010.
[16] MCCANDLESS M, HATCHER E, GOSPODNETIC O. Lucene in action [M]. 2 eds.New York: Manning Publictaion Co., 2010.
[17] DOAN A, RAMARKRISHNAN R, VAITHYANATHAN S. Managing information extraction [C]∥ Proceedings of Special Interest Group on Management of Data 2006. Chicago Illionis, USA: ACM, 2006.
[18] LOWE H J, BARNETT G O.MicroMeSH: a microcomputer system for searching and exploring the national library medicine’s medical subject headings (mesh) vocabulary [J]. Proceedings of the Annual Symposium on Computer Application Medical Care, 1987, 11(4):717-720.
[19] MILLER R A, GIESZCZYKIEWICZ F M, VRIES J K. CHARTLINE: pviding bibliographic references relevant to patient charts using the UMLS Metathesaurus knowledge sources [J]. Proceedings of the Annual Symposium on Computer Application Medical Care, 1992(1): 86-90.
[20] HERSH W R, GREENES R A. SAPHIRE: an information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships [J]. Comput BiomedRes,1990,23(5):410-425.
[21] Newsgroups [OL/EB].[2012-05-25]. http:∥ jrennie/20Newsgroups.
[22] PATIL M, THANKACHAN S, SHAH R. Inverted indexes for phrases and strings [C]∥ Conference on Research and Development in Information Retrieval.Beijing:ACM, 2011.
[23] JUSTIN Z, ALISTAIR M. Inverted files for text search engines [J]. ACM Computing Surveys,2006,38(2):156.
[24] JUNGHOO C, SRIDHAR R. A Fast regular expression indexing engine [C]∥ Proceedings 18th International Conference on Issue Date. Los Angeles, USA:IEEE,2002.
[25] 刘小珠,彭智勇,陈旭.高效的随机访问分块倒排文件自索引技术 [J]. 计算机学报,2010,33(6): 977987.
LIU Xiao-zhu, PENG Zhi-yong, CHEN Xu. An efficient random access block inverted file self-index technology [J]. Chinese Journal of Computers, 2010,33(6): 977-987.
[26] GONZALO N, MATHIEU R.柔性字符串匹配 [M].北京:电子工业出版社, 2007.
[27] KNUTH D E, MORRIS J H, PRATT V R, Fast pattern matching in strings [J]. SIAM Journal on Computing, 1977,6(1):323-350.
[28] YAO A C. The complexity of pattern matching for a random string [J]. SIAM Journal on Computing, 1979,8(3):368-387.
[29] THOMPSON K. Regular expression search algorithm [J]. Communications of the ACM, 1968,11:419-422.
[30] GLUSKOV V M.The abstract theory of automata [J]. Russian Mathematical Surveys, 1961,16:153.

[1] 柯海丰,应晶. 基于R-ELM的实时车牌字符识别技术[J]. J4, 2014, 48(2): 0-0.
[2] 朱凡微, 吴明晖, 应晶. 面向大规模无结构数据的Web方面搜索方法[J]. J4, 2013, 47(6): 990-999.
[3] 冯培恩, 刘屿, 邱清盈, 李立新. 提高Eclat算法效率的策略[J]. J4, 2013, 47(2): 223-230.
[4] 刘颖, 陈岭, 陈根才, 赵江奇, 王敬昌. 基于历史点击数据的集合选择方法[J]. J4, 2013, 47(1): 23-28.
[5] 殷婷,肖敏,陈岭,赵江奇,王敬昌. 基于CQPM的OLAP查询日志挖掘及推荐[J]. J4, 2012, 46(11): 2052-2060.
[6] 肖敏, 陈岭, 夏海元, 陈根才. 基于数据仓库内在特征的OLAP关键词查询[J]. J4, 2012, 46(6): 974-979.
[7] 张丽平,李松,郝晓红,郝忠孝. Jrv粗糙Vague区域关系[J]. J4, 2012, 46(1): 105-111.
[8] 陈岭,许晓龙,杨清,陈根才. 基于三次样条插值的无线信号强度衰减模型[J]. J4, 2011, 45(9): 1521-1527.
[9] 吴明晖, 应晶. 业务过程建模及其形式化验证[J]. J4, 2011, 45(2): 280-287.
[10] 傅朝阳, 高济, 周尤明. 词法多重散列与包容语义相结合的服务查找[J]. J4, 2010, 44(12): 2274-2283.
[11] 杨清, 陈岭, 陈根才. 基于单加速度传感器的行走距离估计[J]. J4, 2010, 44(9): 1681-1686.
[12] 熊伟, 王晓暾. 基于质量功能展开的可信软件需求映射方法[J]. J4, 2010, 44(5): 881-886.
[13] 张引, 何浩, 赵丽娜, 张三元. 网构软件模型中的抽象状态机设计[J]. J4, 2010, 44(5): 923-929.
[14] 蒋涛, 应晶, 吴明晖, 等. 一种面向特征增量的软件产品线分析方法[J]. J4, 2009, 43(12): 2142-2148.
[15] 沈斌, 姚敏. 关联且项项正相关频繁模式挖掘[J]. J4, 2009, 43(12): 2171-2177.