Skip to content

Code detailing the analyses from the CogSci paper "Words with consistent diachronic usage patterns are learned earlier. A computational analysis using temporally aligned word embeddings"

Notifications You must be signed in to change notification settings

GiovanniCassani/semanticShift_AoA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Shift & Age of Acquisition

This repository contains code detailing the analyses performed for the CogSci paper 'Words with consistent diachronic usage patterns are learned earlier. A computational analysis using temporally aligned word embeddings'.

Reference

@article{https://doi.org/10.1111/cogs.12963,
author = {Cassani, Giovanni and Bianchi, Federico and Marelli, Marco},
title = {Words with Consistent Diachronic Usage Patterns are Learned Earlier: A Computational Analysis Using Temporally Aligned Word Embeddings},
journal = {Cognitive Science},
volume = {45},
number = {4},
pages = {e12963},
keywords = {Age of acquisition, Language change, Temporally aligned word embeddings, Computational psycholinguistics},
doi = {https://doi.org/10.1111/cogs.12963},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/cogs.12963},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/cogs.12963},
year = {2021}
}

Content

The project code has been separated in three different folders: All the necessary datasets are provided in the folder data/, whereas the folder src/ contains Python code to read relevant resources to compute OLD20 for the target words as well as R code to preprocess raw data to generate input files for the statistical analyses, run linear models and generate the plots included in the paper. Finally, the code to compute the semantic change measures is available under the measures folder.

Temporal Embeddings with A Compass

To create the aligned embeddings, it is necessary to obtain the CoHA corpus. Then, the TWEC embedding alignment algorithm can be used to aling the slices. It is enough to split the COHA data in 5 sets: 1800-1840, 1840-1880, 1880-1920, 1920-1960, 1960-2000. You should manually pre-process the text before using TWEC (we used spacy to do this).

Follow the instruction on the TWEC to install the tool.

Requirements

Authors

  • Giovanni Cassani, Tilburg University
  • Federico Bianchi, Bocconi University
  • Marco Marelli, University of Milano-Bicocca

About

Code detailing the analyses from the CogSci paper "Words with consistent diachronic usage patterns are learned earlier. A computational analysis using temporally aligned word embeddings"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published