-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
17 lines (14 loc) · 951 Bytes
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
This is a Document Clustering Project
In an Abstract sense the Input For this project is a set of Research Papers and the Number of clusters.
The output is the Clusters containing the Research Papers Names
Explanation About the Different Codes Used :
1) pdf2text.py code converts the Research Paper which is in PDF format to txt format
2) convertAllPdf2Text.sh takes as Input a directory containing all the Research Papers in PDF format
and converts them to txt format and stores all those txt files in a folder called 'TextFiles'
in the same directory
3) tidf.py takes as input the directory name containing the Text Files and the number of clusters.
The output which is the name of the Research Papers is printed to the output
Sequence of Running the Code :
1) Run convertAllPdf2Text.sh
2) Run tidf.py with the respective arguments
The Results for different clusters can be found in the Results Directory