Please wait a minute...
Journal of Zhejiang University-SCIENCE A (Applied Physics & Engineering)  2009, Vol. 10 Issue (8): 1114-1124    DOI: 10.1631/jzus.A0820481
Electrical & Electronic Engineering     
On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis
Can WANG, Zi-yu GUAN, Chun CHEN, Jia-jun BU, Jun-feng WANG, Huai-zhong LIN
School of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
Download:     PDF (0 KB)     
Export: BibTeX | EndNote (RIS)      

Abstract  Focused crawling is an important technique for topical resource discovery on the Web. The key issue in focused crawling is to prioritize uncrawled uniform resource locators (URLs) in the frontier to focus the crawling on relevant pages. Traditional focused crawlers mainly rely on content analysis. Link-based techniques are not effectively exploited despite their usefulness. In this paper, we propose a new frontier prioritizing algorithm, namely the on-line topical importance estimation (OTIE) algorithm. OTIE combines link- and content-based analysis to evaluate the priority of an uncrawled URL in the frontier. We performed real crawling experiments over 30 topics selected from the Open Directory Project (ODP) and compared harvest rate and target recall of the four crawling algorithms: breadth-first, link-context-prediction, on-line page importance computation (OPIC) and our OTIE. Experimental results showed that OTIE significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate. Moreover, OTIE is much faster than the traditional focused crawling algorithm.

Key wordsFocused crawlers      Topical crawlers      PageRank      Classifiers      On-line topical importance estimation (OTIE) algorithm     
Received: 23 June 2008     
CLC:  TP391.3  
Cite this article:

Can WANG, Zi-yu GUAN, Chun CHEN, Jia-jun BU, Jun-feng WANG, Huai-zhong LIN. On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis. Journal of Zhejiang University-SCIENCE A (Applied Physics & Engineering), 2009, 10(8): 1114-1124.

URL:

http://www.zjujournals.com/xueshu/zjus-a/10.1631/jzus.A0820481     OR     http://www.zjujournals.com/xueshu/zjus-a/Y2009/V10/I8/1114

No related articles found!