Abstract
Since WWW provides a large amount of data, it is useful for innovative and creative activities of human beings to retrieve interesting and useful information effectively and efficiently from WWW. In this paper, we attempt to propose a focused crawler for individual activities. We develop an algorithm for deciding where to crawl next for focused crawlers, by integrating the concept of PageRank into the decision. We empirically evaluate our proposal in terms of precision and target recall. Some results show that our system can give good target recall performance regardless of topics on which the crawler system focuses.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R., Castillo, C., Marin, M., Rodriguez, A.: Crawling a country: better strategies than breadth-first for Web page ordering. In: WWW 2005: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 864–872. ACM, New York (2005), http://doi.acm.org/10.1145/1062745.1062768
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2N/2/63e7d8fb6a64027a0c15e6ae3e402889
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999), http://www.sciencedirect.com/science/article/B6VRG-405TDWC-1F/2/f049016_cf8fefd114f056306b5ae4a86
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2G/2/122be31915c6e16c444898fb12cfdf87 ; Proceedings of the Seventh International World Wide Web Conference
Cho, J., Schonfeld, U.: RankMass crawler: A crawler with high PageRank coverage guarantee. In: VLDB 2007: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 375–386 (2007), http://www.vldb.org/conf/2007/papers/research/p375-cho.pdf
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 527–534. Morgan Kaufmann, San Francisco (September 2000), http://www.vldb.org/conf/2000/P527.pdf
Ester, M., Kriegel, H.P., Schubert, M.: Accurate and efficient crawling for relevant websites. In: VLDB 2004: Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 396–407 (2004), http://www.vldb.org/conf/2004/RS10P3.PDF
Haveliwala, T.H.: Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web search. IEEE Transactions on Knowledge and Data Engineering 15(4), 784–796 (2003), http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1208999
Jeh, G., Widom, J.: Scaling personalized web search. In: WWW 2003: Proceedings of the 12th International Conference on World Wide Web, pp. 271–279. ACM, New York (2003), http://doi.acm.org/10.1145/775152.775191
Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: WWW 2001: Proceedings of the 10th International Conference on World Wide Web, pp. 114–118. ACM, New York (2001), http://doi.acm.org/10.1145/371920.371965
Open Directory Project: http://www.dmoz.org/
Pant, G., Menczer, F.: Topical crawling for business intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003), http://www.springerlink.com/content/p0n6lh04f4j7y26u
Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005), http://doi.acm.org/10.1145/1095872.1095875
Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006), http://doi.ieeecomputersociety.org/10.1109/TKDE.2006.12
Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Inf. Retr. 8(3), 417–447 (2005), http://dx.doi.org/10.1007/s10791-005-6993-5
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M. (2010). Where to Crawl Next for Focused Crawlers. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6279. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15384-6_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-15384-6_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15383-9
Online ISBN: 978-3-642-15384-6
eBook Packages: Computer ScienceComputer Science (R0)