Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community contributions: webscraping with Playwright #105

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions week1/community-contributions/day1-webscraping-playwright.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "markdown",
"id": "6900b2a8-6384-4316-8aaa-5e519fca4254",
"metadata": {},
"source": [
"# Connecting to OpenAI\n",
"\n",
"The next cell is where we load in the environment variables in your `.env` file and connect to OpenAI.\n",
"\n",
"## Troubleshooting if you have problems:\n",
"\n",
"Head over to the [troubleshooting](troubleshooting.ipynb) notebook in this folder for step by step code to identify the root cause and fix it!\n",
"\n",
"If you make a change, try restarting the \"Kernel\" (the python process sitting behind this notebook) by Kernel menu >> Restart Kernel and Clear Outputs of All Cells. Then try this notebook again, starting at the top.\n",
"\n",
"Or, contact me! Message me or email [email protected] and we will get this to work.\n",
"\n",
"Any concerns about API costs? See my notes in the README - costs should be minimal, and you can control it at every point. You can also use Ollama as a free alternative, which we discuss during Day 2."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv()\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "019974d9-f3ad-4a8a-b5f9-0a3719aea2d3",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"\n",
"# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
"# If it STILL doesn't work (horrors!) then please see the troubleshooting notebook, or try the below line instead:\n",
"# openai = OpenAI(api_key=\"your-key-here-starting-sk-proj-\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "markdown",
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
"metadata": {},
"source": [
"## An extra exercise for those who enjoy web scraping\n",
"\n",
"You may notice that if you try the course example with \"https://openai.com\" - it doesn't work! That's because OpenAI has a fancy website that uses Javascript. There are many ways around this that some of you might be familiar with. Below an example created with Playwright."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dca2768e",
"metadata": {},
"outputs": [],
"source": [
"! pip install playwright\n",
"! playwright install"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "682eff74-55c4-4d4b-b267-703edbc293c7",
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"from playwright.async_api import async_playwright\n",
"import nest_asyncio\n",
"from bs4 import BeautifulSoup\n",
"import time\n",
"\n",
"nest_asyncio.apply()\n",
"\n",
"class Website:\n",
" title: str\n",
" text: str\n",
" url: str\n",
"\n",
" def __init__(self, url):\n",
" self.url = url\n",
" \n",
" async def run(self, playwright):\n",
" browser = await playwright.chromium.launch(headless=False)\n",
" page = await browser.new_page()\n",
" await page.goto(self.url)\n",
" await page.wait_for_load_state('load')\n",
" \n",
" # Extract data from the page\n",
" self.title = await page.title()\n",
" text = await page.content()\n",
" await browser.close()\n",
" \n",
" soup = BeautifulSoup(text, 'html.parser')\n",
" for irrelevant in soup([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.get_text(separator=\"\\n\", strip=True)\n",
" \n",
" async def main(self):\n",
" async with async_playwright() as playwright:\n",
" await self.run(playwright) \n",
" \n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]\n",
"\n",
"if __name__ == \"__main__\":\n",
" site = Website('https://openai.com')\n",
" asyncio.run(site.main())\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages_for(site)\n",
" )\n",
"\n",
" web_summary = response.choices[0].message.content\n",
" display(Markdown(web_summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69218dec-749c-412d-84a0-40a10fd80c73",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}