This GitHub repository contains code for preprocessing text data from PDF and DOCX files for use with GPT-3. It includes steps such as tokenization, removal of stop words and punctuation, and formatting for GPT-3 input.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
You will need to have Python and pip installed on your machine. You will also need to install the following Python packages:
- PyPDF2
- python-docx
- NLTK
- Pandas
You can install these packages by running the following command:
pip install pypdf2 python-docx nltk pandas
The code in this repository is meant to be used in a Jupyter notebook or any Python environment.
- To preprocess data from a PDF file, use the
pdf_preprocessing.ipynb
notebook or thepdf_preprocessing.py
script - To preprocess data from a DOCX file, use the
docx_preprocessing.ipynb
notebook or thedocx_preprocessing.py
script - Both notebooks contain detailed comments on how to use the code and how it works.
If you have any suggestions for improvements or find any bugs, please feel free to submit a pull request or open an issue.
This code uses the PyPDF2 and python-docx libraries to read and extract text from PDF and DOCX files respectively, and the NLTK library for tokenization and stop word removal.
It's important to note that the code provided here is a sample, you may want to adapt it to your specific use case and add more preprocessing steps as needed. Also, it's important to have a big enough dataset to achieve better results with GPT-3.
This project is licensed under the MIT License - see the LICENSE file for details.