Results 1 to 10 of about 456,160 (210)

An annotated corpus with nanomedicine and pharmacokinetic parameters [PDF]

open access: yesInternational Journal of Nanomedicine, 2017
Nastassja A Lewinski,1 Ivan Jimenez,1 Bridget T McInnes2 1Department of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, VA, 2Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA Abstract:
Lewinski NA, Jimenez I, McInnes BT
doaj   +3 more sources

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only [PDF]

open access: yesarXiv.org, 2023
Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers.
Guilherme Penedo   +8 more
semanticscholar   +1 more source

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus [PDF]

open access: yesConference on Empirical Methods in Natural Language Processing, 2021
Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scraping significant portions of the internet, and are ...
Jesse Dodge   +6 more
semanticscholar   +1 more source

Word Alignment by Fine-tuning Embeddings on Parallel Corpora [PDF]

open access: yesConference of the European Chapter of the Association for Computational Linguistics, 2021
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs. The great majority of past
Zi-Yi Dou, Graham Neubig
semanticscholar   +1 more source

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages [PDF]

open access: yesTransactions of the Association for Computational Linguistics, 2021
We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families).
Gowtham Ramesh   +16 more
semanticscholar   +1 more source

An Analysis of Negation in Natural Language Understanding Corpora [PDF]

open access: yesAnnual Meeting of the Association for Computational Linguistics, 2022
This paper analyzes negation in eight popular corpora spanning six natural language understanding tasks. We show that these corpora have few negations compared to general-purpose English, and that the few negations in them are often unimportant.
Md Mosharaf Hossain   +2 more
semanticscholar   +1 more source

Characteristics of expert´s report as evidence [PDF]

open access: yesSHS Web of Conferences, 2020
In recent years, there has been an increasing need for private expert´s report for judicial evidence. In practise, it appears that a well-developed expert´s report is an important bases for the court´s decision-making. It is no secret that the quality of
Kubica Milan, Švejdová Nikola
doaj   +1 more source

Language Evaluation of Covid-19 Vaccination News: Corpus of Indonesian Newspaper and Appraisal Insights

open access: yesEthical Lingua: Journal of Language Teaching and Literature, 2021
The current study is intended to explore the language evaluation of vaccination of covid-19 news of post-pandemic era in the corpus of the Jakarta post as Indonesian newspapers through the study of Systemic Functional Linguistics, hereafter SFL, by ...
Yogi Setia Samsi   +2 more
doaj   +1 more source

Semantics derived automatically from language corpora contain human-like biases [PDF]

open access: yesScience, 2016
Machines learn what people know implicitly AlphaGo has demonstrated that a machine can learn how to do things that people spend many years of concentrated study learning, and it can rapidly learn how to do them better than any human can.
Aylin Caliskan   +2 more
semanticscholar   +1 more source

The ParlaMint corpora of parliamentary proceedings

open access: yesLanguage Resources and Evaluation, 2022
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words.
T. Erjavec   +27 more
semanticscholar   +1 more source

Home - About - Disclaimer - Privacy