Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geometry of Culture #13

Open
chengjun opened this issue Aug 13, 2020 · 2 comments
Open

Geometry of Culture #13

chengjun opened this issue Aug 13, 2020 · 2 comments

Comments

@chengjun
Copy link
Member

https://github.com/KnowledgeLab/GeometryofCulture

Geometry of Culture
Code and data associated with the ASR paper on the Geometry of Culture. The full paper can be found here: https://journals.sagepub.com/doi/full/10.1177/0003122419877135

Data
Word Embedding Models

We provide 2 pre-trained word embedding models that are used in our analyses.
Google News embedding: https://www.dropbox.com/s/5m9s5326off2lcg/google_news_embedding.zip?dl=0
Google Ngrams US, 2000-12: https://www.dropbox.com/s/v823bz2hbalobhs/google_us_ngrams_embedding.zip?dl=0
GLoVe embedding: https://nlp.stanford.edu/projects/glove/

Google Ngrams Raw Text

For our historical analyses and contemporary validation, we train embedding models on the full Google Ngrams US corpus for particular time periods. The Google Ngrams US corpus is publicly available for download and is hosted here: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

Survey of Cultural Associations

We also provide results from the Mechanical Turk survey of cultural associations. Data files include mean associations on race, class, and gender dimensions for 59 terms. We provide files with and without poststratification weights. These files are hosted here on github in the "survey_data" directory. Details of the survey can be found in Appendix A of the article.

Code
We provide scripts to assist in training embeddings and building "cultural dimensions" according to the method described in the paper. Scripts for complete replication are forthcoming.
w2v_train_model.py trains embedding model on raw text. It is specifically set up to read 5grams, but could be slightly adjusted to read in sentences of natural language.
build_cultural_dimensions.R loads in the pretrained models available above, builds cultural dimensions from the antonym pairs provided in the attached csv files, and validates correspondence between survey estimates and embedding projections.

@chengjun
Copy link
Member Author

library("ggplot2")
memory.limit(1500000)


survey<-read.csv(file="/DIRECTORY/survey_dataset_means_weighted.csv",header=TRUE,row.names=1)
df<-read.csv(file="/DIRECTORY/GoogleNews_Embedding.csv", header=TRUE,row.names=1, sep=",")
#df<-read.csv(file="/DIRECTORY/US_Ngrams_2000_12.csv", header=TRUE,row.names=1, sep=",")

#####DEFINE FUNCTIONS##########
#Calculate norm of vector#
norm_vec <- function(x) sqrt(sum(x^2))

#Dot product#
dot <- function(x,y) (sum(x*y))

#Cosine Similarity#
cos <- function(x,y) dot(x,y)/norm_vec(x)/norm_vec(y)

#Normalize vector#
nrm <- function(x) x/norm_vec(x)

#Calculate semantic dimension from antonym pair#
dimension<-function(x,y) nrm(nrm(x)-nrm(y))

###STORE EMBEDDING AS MATRIX, NORMALIZE WORD VECTORS###
cdfm<-as.matrix(data.frame(df))
cdfmn<-t(apply(cdfm,1,nrm))


###IMPORT LISTS OF TERMS TO PROJECT AND ANTONYM PAIRS#####
ant_pairs_aff <- read.csv("DIR/affluence_pairs.csv",header=FALSE, stringsAsFactor=F)
ant_pairs_gen <- read.csv("DIR/gender_pairs.csv",header=FALSE, stringsAsFactor=F)
ant_pairs_race <- read.csv("DIR/race_pairs.csv",header=FALSE, stringsAsFactor=F)


word_dims<-matrix(NA,nrow(ant_pairs_aff),300)


# ##SETUP "make_dim" FUNCTION, INPUT EMBEDDING AND ANTONYM PAIR LIST#######
# ##OUTPUT AVERAGE SEMANTIC DIMENSION###

make_dim<-function(embedding,pairs){
word_dims<-data.frame(matrix(NA,nrow(pairs),300))
for (j in 1:nrow(pairs)){
rp_word1<-pairs[j,1]
rp_word2<-pairs[j,2]
tryCatch(word_dims[j,]<-dimension(embedding[rp_word1,],embedding[rp_word2,]),error=function(e){})
}
dim_ave<-colMeans(word_dims, na.rm = TRUE)
dim_ave_n<-nrm(dim_ave)
return(dim_ave_n)
}


#####CONSTRUCT AFFLUENCE, GENDER, AND RACE DIMENSIONS######
aff_dim<-make_dim(df,ant_pairs_aff)
gender_dim<-make_dim(df,ant_pairs_gen)
race_dim<-make_dim(df,ant_pairs_race)


####ANGLES BETWEEN DIMENSIONS#######
cos(aff_dim,gender_dim)
cos(aff_dim,race_dim)
cos(gender_dim,race_dim)


####CALCULATE PROJECTIONS BY MATRIX MULTIPLICATION####
###(Equivalent to cosine similarity because vectors are normalized)###
aff_proj<-cdfmn%*%aff_dim
gender_proj<-cdfmn%*%gender_dim
race_proj<-cdfmn%*%race_dim

projections_df<-cbind(aff_proj, gender_proj, race_proj)
colnames(projections_df)<-c("aff_proj","gender_proj","race_proj")


####MERGE WITH SURVEY AND CALCULATE CORRELATION####
projections_sub<-subset(projections_df, rownames(projections_df) %in% rownames(survey))
colnames(projections_sub)<-c("aff_proj","gender_proj","race_proj")
survey_proj<-merge(survey,projections_sub,by=0)


cor(survey_proj$survey_class,survey_proj$aff_proj)
cor(survey_proj$survey_gender,survey_proj$gender_proj)
cor(survey_proj$survey_race,survey_proj$race_proj)

# #######################################################################


###CREATE VISUALIZATION###
wlist=c("camping","baseball","boxing","volleyball","softball","golf","tennis","soccer","basketball","hockey")
Visualization<-ggplot(data=data.frame(projections_df[wlist,]),
                      aes(x=gender_proj,
                          y=aff_proj,
                          label=wlist)) 
                    + geom_text()
Visualization+ theme_bw() +ylim(-.25,.25) +xlim(-.25,.25)

# #######################################################################

@chengjun
Copy link
Member Author

image

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant