Results 1 to 10 of about 25,762 (157)
The Web as a Parallel Corpus [PDF]
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements.
Philip Resnik, Noah A. Smith
doaj +2 more sources
In this article the potential of the multilingual Web to function as a corpus, in addition to a source for corpus creation, is examined. Despite the fact that English dominates the Web, and despite the fact that most work in corpus linguistics revolves ...
Gilles-Maurice de Schryver
doaj +3 more sources
Introduction to the Special Issue on the Web as Corpus
The Web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is a fabulous linguists' playground. This special issue of Computational Linguistics explores ways in which this dream is being explored.
Adam Kilgarriff, Gregory Grefenstette
doaj +2 more sources
The Web as Corpus and Online Corpora for Legal Translations
Legal language is hallmarked by a pedantic and user-unfriendly jargon whose constructs are all but intuitive, not to mention the legal system specificity which makes it unique in every country.
Giampieri Patrizia
doaj +7 more sources
The web as a corpus: a resource for translation
[full article, abstract in English; abstract in Lithuanian] Accessing ready-made corpora may not be always easy. This is especially true for less dominant languages such as Persian for which the number of available corpora is very limited.
Helia Vaezian
doaj +3 more sources
Focused Web Corpus Crawling [PDF]
In web corpus construction, crawling is a necessary step, and it is probably the most costly of all, because it requires expensive bandwidth usage, and excess crawling increases storage requirements. Excess crawling results from the fact that the web contains a lot of redundant content (duplicates and near-duplicates), as well as other material not ...
Schäfer, Roland +2 more
openaire +2 more sources
The PAISÀ Corpus of Italian Web Texts [PDF]
PAIS`A is a Creative Commons licensed, large web corpus of contemporary Italian. We describe the design, harvesting, and processing steps involved in its creation.
Verena Lyding +8 more
openaire +4 more sources
Corpulyzer: A Novel Framework for Building Low Resource Language Corpora
The rapid proliferation of artificial intelligence has led to the development of sophisticated cutting-edge systems in natural language processing and computational linguistics domains.
Bilal Tahir, Muhammad Amir Mehmood
doaj +1 more source
“Chatbot Communication” as an Object of Linguistic Research in the System of Digital Communications
Introduction. The authors of the article consider one of the interesting technologies in the digitalization segment – a chatbot – from a linguistic point of view.
S. V. Kiseleva +2 more
doaj +1 more source
This paper is based on the Corpus of Global Web-based English (GloWbE) which was compiled by Mark Davies in 2013. The GloWbE corpus consists of web data from 20 different English speaking countries.
Kazi Amzad Hossain
doaj +1 more source

