Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6279))

  • 1608 Accesses

  • 2 Citations

Abstract

Since WWW provides a large amount of data, it is useful for innovative and creative activities of human beings to retrieve interesting and useful information effectively and efficiently from WWW. In this paper, we attempt to propose a focused crawler for individual activities. We develop an algorithm for deciding where to crawl next for focused crawlers, by integrating the concept of PageRank into the decision. We empirically evaluate our proposal in terms of precision and target recall. Some results show that our system can give good target recall performance regardless of topics on which the crawler system focuses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Baeza-Yates, R., Castillo, C., Marin, M., Rodriguez, A.: Crawling a country: better strategies than breadth-first for Web page ordering. In: WWW 2005: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 864–872. ACM, New York (2005), http://doi.acm.org/10.1145/1062745.1062768

    Chapter  Google Scholar 

  2. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2N/2/63e7d8fb6a64027a0c15e6ae3e402889

    Article  Google Scholar 

  3. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999), http://www.sciencedirect.com/science/article/B6VRG-405TDWC-1F/2/f049016_cf8fefd114f056306b5ae4a86

    Article  Google Scholar 

  4. Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2G/2/122be31915c6e16c444898fb12cfdf87 ; Proceedings of the Seventh International World Wide Web Conference

    Article  Google Scholar 

  5. Cho, J., Schonfeld, U.: RankMass crawler: A crawler with high PageRank coverage guarantee. In: VLDB 2007: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 375–386 (2007), http://www.vldb.org/conf/2007/papers/research/p375-cho.pdf

  6. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 527–534. Morgan Kaufmann, San Francisco (September 2000), http://www.vldb.org/conf/2000/P527.pdf

    Google Scholar 

  7. Ester, M., Kriegel, H.P., Schubert, M.: Accurate and efficient crawling for relevant websites. In: VLDB 2004: Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 396–407 (2004), http://www.vldb.org/conf/2004/RS10P3.PDF

  8. Haveliwala, T.H.: Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web search. IEEE Transactions on Knowledge and Data Engineering 15(4), 784–796 (2003), http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1208999

    Article  Google Scholar 

  9. Jeh, G., Widom, J.: Scaling personalized web search. In: WWW 2003: Proceedings of the 12th International Conference on World Wide Web, pp. 271–279. ACM, New York (2003), http://doi.acm.org/10.1145/775152.775191

    Google Scholar 

  10. Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: WWW 2001: Proceedings of the 10th International Conference on World Wide Web, pp. 114–118. ACM, New York (2001), http://doi.acm.org/10.1145/371920.371965

    Chapter  Google Scholar 

  11. Open Directory Project: http://www.dmoz.org/

  12. Pant, G., Menczer, F.: Topical crawling for business intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003), http://www.springerlink.com/content/p0n6lh04f4j7y26u

    Google Scholar 

  13. Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005), http://doi.acm.org/10.1145/1095872.1095875

    Article  Google Scholar 

  14. Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006), http://doi.ieeecomputersociety.org/10.1109/TKDE.2006.12

    Article  Google Scholar 

  15. Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Inf. Retr. 8(3), 417–447 (2005), http://dx.doi.org/10.1007/s10791-005-6993-5

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M. (2010). Where to Crawl Next for Focused Crawlers. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6279. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15384-6_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15384-6_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15383-9

  • Online ISBN: 978-3-642-15384-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics