1.College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
2. Zhejiang Branch of Industrial and Commercial Bank of China, Hangzhou 310009, China
A new vertical search engine object cache optimization strategy was proposed to address the challenges like the changeful of popular objects, the property of query triggered data crawl and so on. A popular object prediction model was proposed based on relationships between objects and their properties in order to predict the tendency of popular object distribution. Since user query and data changed by Poisson process, a procedure to maximize the data freshness and an optimal strategy to distribute and balance resource were proposed. Experimental results show that the increase in time complexity is relative limited, while the average freshness of user query result and the query precision ratio preceded traditional fixed-rate cache strategy.
[1] WU Y, SHOU L, HU T, et al. Query triggered crawling strategy: build a time sensitive vertical search engine [C]∥Proceedings of the 2008 International Conference on Cyberworlds. Hangzhou: IEEE, 2008: 422-427.
[2] BREWINGTON B E,CYBENKO G. How dynamic is the web [J]. Computer Networks, 2000, 33(1/6): 257-276.
[3] BREWINGTON B E,CYBENKO G. Keeping up with the changing web [J]. IEEE Computer, 2000, 33(5): 52-58.
[4] GRIMES C, BRIEN S O. Microscale evolution of Web pages [C]∥Proceeding of the 17th International Conference on World Wide Web. Beijing: ACM, 2008: 1149-1150.
[5] CHO J, GARCIAMOLINA H. The evolution of the web and implications for an incremental crawler [C]∥Proceedings of the 26th International Conference on Very Large DataBases. San Francisco: Morgan Kaufmann, 2000: 200-209.
[6] FETTERLY D, MANASSE M, NAJORK M, et al. A largescale study of the evolution of web pages [C]∥Proceedings of the 12th International Conference on World Wide Web. New York: ACM, 2003: 669-678.
[7] OLSTON C, PANDEY S. Recrawl scheduling based on information longevity [C]∥ Proceedings of the 17th International World Wide Web Conference. Beijing: ACM, 2008: 437-446.
[8] CHO J, GARCIAMOLINA H. Estimating frequency of change [J]. ACM Transactions on Internet Technology, 2003, 3(3): 256-290.
[9] CHO J, GARCIAMOLINA H. Effective page refresh policies for Web crawlers [J]. ACM Transactions on Database Systems, 2003, 28(4): 390-426.
[10] SATO N, EUHARA M, SAKAI Y. FTFIDF scoring for fresh information retrieval [C]∥ Proceedings of the 18th International Conference on Advanced Information Networking and Application. [S.l.]: IEEE, 2004: 165-170.
[11] SATO N, EUHARA M, SAKAI Y. The evaluations of FTFIDF scoring for fresh information retrieval [C]∥Proceedings of the 19th International Conference on Advanced Information Networking and Applications. [S.l.]: IEEE, 2005: 635-640.