Humkinar: Construction of a Large Scale Web Repository and Information System for Low Resource Urdu Language | IEEE Journals & Magazine | IEEE Xplore

Humkinar: Construction of a Large Scale Web Repository and Information System for Low Resource Urdu Language


Architecture of Urdu Search Engine

Abstract:

Online content availability, commercial viability, and technological advancements for English and European languages direct mainstream search engines to prioritize the se...Show More

Abstract:

Online content availability, commercial viability, and technological advancements for English and European languages direct mainstream search engines to prioritize the search results of these high-resource languages. This makes it challenging for low-resource language users to access the search results in regional languages which is essential to promote literacy, inclusion, and digital accessibility. In this article, we create Humkinar– a Urdu language search engine using open-source tools. Our search engine is designed with five key components: computing infrastructure, data collector, search manager, web analytics engine, and user interface. First, our in-house computing infrastructure offers 160 GB RAM, 80 cores, and 30 TB memory to support the operations of the search engine. Next, we customize an open-source web crawler with a specialized Urdu language-focused URL selection algorithm, webpage parser, and content selection mechanism to collect Urdu webpages with optimized computing and Internet resources. We also employ specialized content scrapers to collect targeted and high-priority Urdu content like news articles, Wikipedia, poetry, and books. Overall, our data collector module has successfully curated a repository containing 14 million crawled webpages and 2.2 million scraped Urdu documents. Also, we design post-processing tools for tasks such as topic classification, de-duplication, profanity assessment, text summarization, and the scoring of website quality specific to the Urdu language. In addition, acknowledging the limitations of applying conventional ranking signals to Urdu language, search manager utilizes our seven derived ranking signals for search results. These signals are tuned to emphasize the richness and quality of Urdu language websites and content in search results. Moreover, we incorporate a web analytics engine into our search engine to collect and analyze user actions and metadata to enhance the overall functionality and effectiveness of...
Architecture of Urdu Search Engine
Published in: IEEE Access ( Volume: 12)
Page(s): 128404 - 128423
Date of Publication: 05 September 2024
Electronic ISSN: 2169-3536

Funding Agency:


References

References is not available for this document.