Computational Semantics Computational Language and Education Research CLEAR University of Colorado Boulder
She has done work in social media analysis, sentiment mining, summarization, and text mining for the digital humanities. Our goal is richer and more accurate representations of utterances in English, Chinese, Hindi/Urdu, and Arabic. Our principle approach involves the application of supervised machine learning to data with linguistic annotation.
CLICS: World’s largest database of cross-linguistic lexical associations
The new version also guarantees the reproducibility of the data aggregation process, conforming to best practices in research data management. “Thanks to the new standards and workflows we developed, our data is not only FAIR (findable, accessible, interoperable, and reproducible), but the process of lifting linguistic data from their original forms to our cross-linguistic standards is also much more efficient than in the past,” says Robert Forkel. Every language has cases in which two or more concepts are expressed by the same word, such as the English word fly, which refers to both the act of flying and to the insect. By comparing patterns in these cases, which linguists call colexifications, across languages, researchers can gain insights into a wide range of issues, including human perception, language evolution, and language contact. The third installment of the CLICS database significantly increases the number of languages, concepts, and data sources available in earlier versions, allowing researchers to study colexifications on a global scale in unprecedented detail and depth.
Department of Linguistics
- The third installment of the CLICS database significantly increases the number of languages, concepts, and data sources available in earlier versions, allowing researchers to study colexifications on a global scale in unprecedented detail and depth.
- There are several different layers of annotation, and correspondingly several individual NLP components, many of which are trained on a single layer.
- Next we describe the lexical resources which inform the linguistic annotation, and then the individual layers of annotation and the different domains and genres they have been applied to.
His PhD topic is titled ‘Distributional Models of Semantic Change’ and is supervised by Prof. Sabine Schulte im Walde.
CLICS: World’s largest database of cross-linguistic lexical associations
Students were tasked with working through the different steps of data set creation described in the study, e.g. data extraction, data mapping (to reference catalogs), and identification of sources. “Having people from outside of the core team use and test your tools is essential and helps tremendously in fine-tuning all processes,” says Christoph Rzymski. He hold a BA in Linguistics and English as well as an MSc in Computational Linguistics from the same university. Before entering the PhD programme in 2017, Dominik taught formal semantics at the Institute of Linguistics at the University of Stuttgart.
Distributional models of word embeddings (e.g., predictive models like word2vec or co-occurrence counts models like PPMI) have become prevalent in NLP studies and related research. The use of these models has also been established in usage-based linguistics, ranging from studies of polysemy (Schütze 1998; Heylen et al. 2015) and language variation (Jenset et al. 2018) to semantic change (Dubossarsky et al. 2015; Perek 2015). In this workshop, which takes place in the context of the DH research seminar series at the University of Helsinki, we wish to foster discussions between NLP researchers, (digital) historians, and historical linguists. In that sense, it echoes the workshop on automatic detection of language change 2018, co-located with SLTC.
There are several different layers of annotation, and correspondingly several individual NLP components, many of which are trained on a single layer. We begin by describing several different end-to-end systems we are building which incorporate these components, then describe the individual components. Next we describe the lexical resources which inform the linguistic annotation, and then the individual layers of annotation and the different domains and genres they have been applied to. With CLICS and its workflow being accessible to a wider audience, scholars cannot only directly contribute to the database in the future; they can also profit from the established machinery and start their own targeted collections.
CLICS: World’s largest database of cross-linguistic lexical associations
“In this study, CLICS was used to study differences in the lexical coding of emotion in languages around the world, but the potential of the database is not limited to emotion concepts. Many more interesting questions can be tackled in the future,” says Johann-Mattis List. These studies might seem of little relevance for linguists who want to use word embeddings as off-the-shelf models to study linguistic phenomena. However, contrary to this naïve impression, it has been shown that these deficiencies may drastically bias the analysis of the linguistic phenomena studied, and as a consequence, may lead researchers to unsound conclusions (Dubossarsky, Grossman & Weinshall 2017).
In this talk I will give an overview of the work done in computational detection of semantic change over the past decade. I will present both lexical replacements and semantic change, and the impact these have on research in e.g., digital humanities. I will talk about the challenges of detecting as well as evaluating lexical semantic change, and our new project connecting computational work with high-quality studies in historical linguistics. Dr Haim Dubossarsky completed his PhD at the Hebrew University of Jerusalem under the supervision of Prof. Daphna Weinshall (CS department) and Dr Eitan Grossman (Linguistics department). Though he obtained training in psycholinguistics and computational neuroscience, Dr Dubosarsky devoted his doctoral training to the study of computational linguistics, and particularly the field of semantic change. Building on his multidisciplinary skills, his work made both scientific and methodological contributions to the field which were published in top tier venues.
- She obtained a Ph.D. in Computer Science from L3S Research Center at the University of Hanover, Germany, in 2013.
- In that sense, it echoes the workshop on automatic detection of language change 2018, co-located with SLTC.
- His PhD topic is titled ‘Distributional Models of Semantic Change’ and is supervised by Prof. Sabine Schulte im Walde.
- Though he obtained training in psycholinguistics and computational neuroscience, Dr Dubosarsky devoted his doctoral training to the study of computational linguistics, and particularly the field of semantic change.
“The number of linguists who actively use our standards and workflows is constantly increasing. We hope that the release of this new version of CLICS will propagate them further,” says Simon Greenhill. With detailed computer-assisted workflows, CLICS facilitates the standardization of linguistic datasets and provides solutions to many of the persistent challenges in linguistic research. “While data aggregation was generally based on ad-hoc procedures in the past, our new workflows and guidelines for best practice are an important step to guarantee the reproducibility of linguistic research,” says Tiago Tresoldi. The effectiveness of the workflow developed for CLICS has been tested and confirmed in various validation experiments involving a large range of scholars and students. Two different student tasks were conducted, resulting in the creation of new datasets and the progressive improvement of the existing data.
Our studies focus on the role word frequency and sampling have on the accuracy of word embeddings, and show how these two have far-reaching consequences for the study of semantic change (Dubossarsky, Grossman & Weinshall 2017) and polysemy research (Dubossarsky, Grossman & Weinshall 2018). In addition to both empirical and theoretical analyses, we propose a general method that allows the continued use of word embeddings by mitigating their deficiencies through carefully crafted control condition. She obtained a Ph.D. in Computer Science from L3S Research Center at the University of Hanover, Germany, in 2013. Her main research interest lies in automatic detection of diachronic language change, in particular word sense change, but she is interested in information extraction and change detection in general.
A particular focus of these talks will be the need for proper evaluation frameworks for the study of semantic change. The ability of CLICS to provide new evidence to address cutting-edge questions in psychology and cognition has already been illustrated in a recent study published in Science, which concentrated on the world-wide coding of emotion concepts. The study compared colexification networks of words for emotion concepts from a global sample of languages, and revealed that the meanings of emotions vary greatly across language families.