Web scraping is a data extraction process used to extract data from different websites and store them in a desired file format like csv,excel etc. to perform web scraping there are few modules available in the market which can be used for web sraping.
- Requests
- BeautifulSoup
- CSV
- In order to use the power of python to scrap websites, we can use existing libraries to get the job done.
- We will install the following libraries using pip :
pip install requests
pip install beautifulsoup4
pip install python-csv
- In order to work with the HTML, we will have to get the HTML as a string.
- We will leverage the power of python requests module to get this done!
- The next step then will be to parse the HTML content and give it a tree like structure so that it can be traversed.
- Once the HTML is fetched using the requests as an string, we need to parse it.
- For parsing, we will use python's BeautifulSoup module which will create a tree like structure for our DOM.
- Once the HTML is fetched and parsed, the next step is to manipulate the tree using BeautifulSoup's functions to get our job done.
- This tutorial will teach you how to get started and traverse the tree.
- Open Excel on a blank workbook
- Within the Data tab, click on From Text button (if not activated, make sure an empty cell is selected)
- Browse and select the CSV file
- In the Text Import Wizard, change the File_origin to "Unicode (UTF-8)"
- Go next and from the Delimiters, select the delimiter used in your file e.g. comma
- Finish and select where to import the data
The Arabic characters should show correctly.