Abstract:
Cross-domain recommendation (CDR) has recently emerged as an effective way to alleviate the cold-start and sparsity issues faced by recommender systems, by transferring information from an auxiliary domain to a target domain to improve recommendations. Studying the similarity between domains is a novel direction in CDR research, potentially opening doors for further exploration. In this context, we introduce a systematic approach to quantify similarity between a pair of domains and explore how current CDR methods perform with both similar and dissimilar domain combinations. We achieve this by presenting two original similarity metrics. Our extensive empirical evaluation on different domain combinations demonstrates that the state-of-the-art CDR algorithms do not perform significantly better when using source domains that are more similar to the target domain, compared to those that are less similar. Importantly, we find that no matter how similarity is measured, it does not correlate with the recommendation performance of the state-of-the-art algorithms.
This is the code repository for this research project. This repository contains the source code of the two similarity metrics that were presented in the paper (Embedding-based Domain Similarity & Inter-domain Item Similarity). There are two folders: Full_Project_Code
and Custom_Project_Code
. The Full_Project_Code
folder contains the code and the data we used to run the experiments from our paper. The Custom_Project_Code
folder allows users to use our similarity metrics to compute similarities between their own source and target domain data.
The data that we used to run experiements is in the Full_Project_Code
directory, and below are details regarding the files:
GloVe_File
: This directory contains the pre-trained GloVe embeddings file that we use to retrieve embeddings for tags.dataframes
: This directory contains the dataframes for every domain that we used in the study.domain_embeddings
: This directory contains domain embeddings for each domain, which were created by runningcreate_domain_embeddings.py
.domain_embedding_similarity_results
: This directory contains the similarity values between different domain combinations across three datasets using the Embedding-based Domain Similarity method.pairwise_similarities
: This directory contains the similarity values between different domain combinations across three datasets using the Inter-domain Item Similarity method.create_domain_embeddings.py
: This file created the domain embeddings for each domain based on the dataframes for each domain.domain_embedding_similarities.py
: This file computes the similarity between domain embeddings using the Embedding-based Domain Similarity method, and writes the results to thedomain_embedding_similarity_results
directory.pairwise_similarities.py
: This file computes the similarity betwween domains using the Inter-domain Item Similarity method, and writes the results to thepairwise_similarities
directory.utils.py
: This function contains helper functions that are used throughout the python files in this project repository.
- Enter the folder that contains the data we used for experimentaion
cd Full_Project_Code
- To retrieve the similarity values between domains using the Embedding-based Domain Similarity method, run the command below:
python3 domain_embedding_similarities.py
- To retrieve the similarity values between domains using the Inter-domain Item Similarity method, run the command below:
python3 pairwise_similarities.py
To run the similarity metrics using your own data, navigate to the Custom_Project_Code
directory:
dataframes
: This directory should contain the dataframes for your source and target domains. Make sure each dataframe has only two columns (item_id
&tags
). Theitem_id
column should be an integer, and thetags
columns should be a string with tags seperated by commas. Convert your dataframes into pickle files usingpandas.DataFrame.to_pickle()
and place the pickle files in this directory. The names of the dataframes must besource_domain_df
andtarget_domain_df
.domain_embeddings
: This directory will contain domain embeddings for your source and target domains, which are created by runningcustom_domain_embeddings.py
.domain_embedding_similarity_results
: This directory will contain the similarity values between your source and target domains using the Embedding-based Domain Similarity method.pairwise_similarities
: This directory contains the similarity values between your source and target domains using the Inter-domain Item Similarity method.custom_domain_embeddings.py
: This file creates the domain embeddings for your source and target domains based on the dataframes in thedataframes
directory.custom_domain_embedding_similarities.py
: This file computes the similarity between your domain embeddings using the Embedding-based Domain Similarity method, and writes the results to thedomain_embedding_similarity_results
directory.custom_pairwise_similarities.py
: This file computes the similarity between your domains using the Inter-domain Item Similarity method, and writes the results to thepairwise_similarities
directory.custom_utils.py
: This function contains helper functions that are used throughout the python files in this project repository and extra functions to deal with your custom data.
- Enter the folder that contains the data we used for experimentaion
cd Custom_Project_Code
- Create two dataframes called
source_domain_df
andtarget_domain_df
that have two columns (item_id
&tags
). - Convert your dataframes into pickle files, and add them to the
dataframes
directory. - Create domain embeddings for both your source and target domains:
python3 custom_domain_embeddings.py
- To retrieve the similarity values between domains using the Embedding-based Domain Similarity method, run the command below:
python3 custom_domain_embedding_similarities.py
- Enter the folder that contains the data we used for experimentaion
cd Custom_Project_Code
- Create two dataframes called
source_domain_df
andtarget_domain_df
that have two columns (item_id
&tags
). - Convert your dataframes into pickle files, and add them to the
dataframes
directory. - To retrieve the similarity values between domains using the Inter-domain Item Similarity method, run the command below:
python3 custom_pairwise_similarities.py
`