forked from almightyGOSU/ANR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path___notes___preprocessing_part_1.txt
46 lines (29 loc) · 1.86 KB
/
___notes___preprocessing_part_1.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
cd to 'preprocessing' folder
run the following command:
python3 preprocessing_simple.py -d amazon_instant_video -dev_test_in_train 1
NOTE:
you should have the input reviews/rating in the 'datasets' folder, e.g. amazon_instant_video.json
this will be the 'dataset' used in the -d argument
There should be 8 files created in the corresponding folder, e.g. datasets/amazon_instant_video/
The log file:
(1) E.g. datasets/amazon_instant_video/amazon_instant_video___preprocessing_log.txt
Files required to run the model:
(2) E.g. datasets/amazon_instant_video/amazon_instant_video_env.pkl
(3) E.g. datasets/amazon_instant_video/amazon_instant_video_info.pkl
(4) E.g. datasets/amazon_instant_video/amazon_instant_video_split_train.pkl
(5) E.g. datasets/amazon_instant_video/amazon_instant_video_split_dev.pkl
(6) E.g. datasets/amazon_instant_video/amazon_instant_video_split_test.pkl
(7) E.g. datasets/amazon_instant_video/amazon_instant_video_uid_userDoc.npy
(8) E.g. datasets/amazon_instant_video/amazon_instant_video_iid_itemDoc.npy
Purpose of files:
(2) contains various mappings, e.g. word to wid, user to uid, etc
(3) contains various statistical information, e.g. number of users, number of items, number of data samples for train/dev/test
(4), (5), (6) contains data samples, in the form of (uid, iid, rating), for train, dev, and test, respectively
(7) is a (num_users x max_doc_len) numpy matrix
- num_users: Number of users, and this depends on the dataset
- max_doc_len: Maximum length of each user document, i.e. 500 in our experiments
- Each entry in this matrix is a wid
(8) is a (num_items x max_doc_len) numpy matrix
- num_items: Number of items, and this depends on the dataset
- max_doc_len: Maximum length of each item document, i.e. 500 in our experiments
- Each entry in this matrix is a wid