Skip to content

Commit

Permalink
dbscan clusters on story features
Browse files Browse the repository at this point in the history
  • Loading branch information
dcolinmorgan committed Mar 29, 2024
1 parent e1e6bfc commit 490bf9b
Show file tree
Hide file tree
Showing 8 changed files with 303 additions and 226 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/daily.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
run: |
source dots/bin/activate
python -m spacy download en_core_web_sm
python -m main -d 0 -n 100 -f 10
python -m main -d 4 -n 200 -f 10
env:
OS_TOKEN: ${{ secrets.OS_TOKEN }}
LOBSTR_KEY: ${{ secrets.LOBSTR_KEY }}
80 changes: 80 additions & 0 deletions DOTS/output/full_small0_dots_feats.csv

Large diffs are not rendered by default.

80 changes: 80 additions & 0 deletions DOTS/output/small0_dots_feats.csv
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,83 @@
"[""['Gaza City"", ' Israel (General)']","['widespread famine', 'cease', 'disease']"
"[""['Moreton Bay"", ' Queensland']","['western australia', 'floodwaters', 'mainland australia']"
"['[""[\'callaway county sheriff office\']""', "" '17-01-2024'""]","['morning', 'regan mertz', 'remains']"
famine,trailblazers,camps,bases,egypt,evacuation,cease,rebels,widespread famine,disease
waste,cataclysms,disasters,eruption,valorizzazione,mount vesuvius,ancient cataclysm,mariano nuzzo,costruzione,beach
district,discussion,firearm,clarendon,old harbour,sunday,new bowen,whatsapp,st catherine,fight
700 days,cancellations,travel delays,school cancellations,streak,snow shovels,meteorologists,snowstorm,accuweather alerts,accuweather meteorologists
vehicle,juvenile,detectives,wbsm,kate robinson,saturday,traffic,ariel dorsey,massachusetts law,conflicting statements
extensive damage,medical attention,breathing apparatus,emergency,damage,afternoon,breathing,rescue service,emergency services,firefighters
checklists,northeast,miles,magnitude,morning,preparedness,watsonville,earthquake,disaster,quake
saleem algerk,bread,early recovery,camps,pumping,cholera cases,neighboring countries,symptom,cholera,mohamad katoub
volcano,grindavík,volcanoes,lava flows,eruption,earthquakes,evacuation,jóhannesson,eruptions,lúðvík pétursson
glaesemann,30th street,39 horses,horses,farnam street,jose galeno,federal courthouse,winter storms,confiscated horses,heavy snow
reward,detective,victim,washington,decision,perpetrator,conviction,michael dorgan,minutes,southern border
shiite muslims,helicopter,revenge,israeli jews,women workers,bases,struggle,civilian ships,cease,ceasefire
stabbings,collapse,hard work,garcetti,volatile times,buy voltas,diktat,sonam wangchuk,poonawalla,underperformer
emotions,rain returns,couple,adrenaline,kitchen,swaying,neighbors,cason wolcott,mixed emotions,philip wolcott
resource,representation,server,mod_security,error
shifa hospital,winds nw,khan yunis,medwish,weeks,mph,israeli soldiers,ahmed kahlot,nominations,israeli interrogators
injuries,comments,eric sweda,sweda,standard procedure,confrontation,dispatcher,dispatchers,meagan drillinger,rachel cavanaugh
inbox,llc
material,occupants,block,day,cause,home,news,vehicle,casasia drive,orlando
surveillance,temperatures,wildfires,laffy taffy,surrounding counties,cooler temperatures,unwanted wildfires,gusty winds,disease,golden eagles
scene,home,evening,water,injuries,fredericktown,monday evening,highway oo,firefighters,temperatures
investigation,lyndaker,structure,rights,cause,department,flames,home,deputies,firefighters
programs,content,commissions,affiliate,retailer,products,links,retailer sites,marketing,hearst television
fishing,eruptive activity,atmosphere,swarm,eruption,earthquakes,machines,enormous clouds,keflavík airport,sundhnúksgígar
temperatures,frost,drizzle,west winds,cold weather,highest temperatures,lowest temperatures,widespread frost,freezing fog,west breezes
sister station,monday,kcra,california,sister,metal,overhang,damage,hearst television,pumps
snowfall,preparedness,vancouver island,afternoon,greater victoria,winter conditions,extreme weather,freezing rain,heavy snow,widespread snow
gospel
famine,heavy fighting,matthew wright,bases,rebels,israeli airstrikes,evacuation,baghdad,widespread famine,disease
opportunity,wednesday,newsmax,weather,votes,march,night,tuesday night,vote,shutdown
hometowns,new york,accomplice,damage,morning,stanivukovic,mclain street,federal charges,explosion,dispute
block,command,solution,page,phrase,word,ray id,attacks,malformed data,online attacks
search,whitelist,rescue service,secondary schools,rescue services,firefighters,dublin castle,afternoon,near miss,evacuation
site,button,page,browser
texas,calls,witnesses,debris,restaurant,conversation,fatalities,floor,atmos energy,explosion
flooding,flood advisory,mph,40 mph,35 mph,daytime highs,gusty winds,heavy snow,widespread snow,blowing snow
alaska airlines,cancellations,differences,flight disruption,tax brackets,wind chills,reagan airport,degrees,winter weather,severe weather
fri,redding,thu,vehicle,interactive radar,ashley gardner,completion,thursday,friday,firefighters
parks,opportunities,envisionwise technology,destination,trees,sledding,vincent moore,windstorm,gates,willamette week
boats,blaze,coast guard,large flames,leathead road,visible flame,emergency,firefighters,emergency crews,okanagan lake
half,headaches,headache,chaos,dulles international,bare pavement,victims,husbands,search,reagan national
protesters,hanukkah,december,los angeles,fm corvallis,ceremony,jewish voice,jewish portlanders,ceasefire,minneapolis
wine,trees,services,industry body,rainfall,emergency,grape,emergency services,afternoon,thunderstorm
death,smoke,result,victim,montgomery,morning,truck,search,rescue,faster
damage,smoke damage,morning,heating,lamp,emergency,traffic,required fields,firefighters,nominee
retailer sites,aromas,miles,magnitude,hearst television,damage,morning,earthquake,quake,watsonville
baluchistan,hosting insurgents,islamic state,provincial resources,baluch nationalists,baluchistan province,behest,insurgency,bases,switzerland
officials,kwqc,rights,scene,iowa,sivyer steel,plant,september,trucks,afternoon
evacuation,israeli forces,rebels,bases,israeli airstrikes,cease,baghdad,egypt,widespread famine,disease
saturday,court documents,february,robbers,red sneakers,robbery,kevin schuster,surveillance,rensselaer,rensselaer man
nfl evaluators,freezing,crystal,windy conditions,freezing surfaces,betting odds,temperatures,winter weather,orlando,freezing rain
prevention,weeks,snowpack conditions,record rainfall,princeton university,damaging storm,winter storms,additional snow,significant snow,wet snow
blaze,damage,kaitangata,coal dust,embers,roofing iron,morning,parents,january,firefighters
smoke,wkrc,result,tuesday,interactive radar,fires,critical condition,pleasant ridge,floor,medicine
case,suspicious pasture,fires,john weda,weda,ranches,100 yards,grass,hay bales,circumstances
news,gardaí,ukraine,war,services,families,lanesboro convent,emergency,emergency services,refugees
kilometres,centimetres,10 centimetres,greater victoria,rain forecasts,wind chill,25 centimetres,snowstorm,winter storm,saskatchewan
approach,flood,storm henk,preventative measures,flood protections,floodplain,flooding,existing defences,flood resilience,deteriorating defences
higher projections,dusting,laguardia airport,friday night,steven yablonski,frigid cold,snowstorm,additional snowfall,accumulating snow,measurable snow
trucks,jason fitz,javon bullard,afternoon,nfl evaluators,crystal,precaution,betting odds,orlando,heavy snowfall
browser,subscribe,tuesday,weather,attic,blaze,damage,morning,firefighters,muscatine firefighters
blaze,damage,medical attention,breathing apparatus,breathing,emergency,rescue service,emergency services,firefighters,afternoon
monday,vehicle,sunday,interactive radar,holiday circle,roanoke man,saturday,rescue,firefighters,kaylee shipley
avalanche warnings,icy conditions,afternoon,freeze,friday morning,701 days,degrees,temperatures,728 days,freezing rain
beliefs,sharp condemnation,bases,transgender people,baghdad,transgender,switzerland,iraqi kurdistan,taiwanese voters,gibraltar eagle
child,flames,trailer,hospital,oakland avenue,adults,morning,children,emergency,firefighters
hotter summers,coastal flooding,aaron sutherland,erosion,bigger dykes,floodplain exposure,catastrophes,wildfires,natural catastrophes,insured damage
radio
caution,heating,emergency,temperature,wwl louisiana,flammable material,firefighters,frozen pipes,temperatures,freezing temperatures
traffic,supervisors,offensive comments,israelis,controversy,israeli citizens,president reagan,joyce karam,hamas terrorists,ceasefire
uxbridge police,emergency crews,sunbeam television,degrees,sudbury river,temperatures,crashes,winter temperatures,falling temperatures,stormy weather
cape elizabeth,eric laszlo,rescue crews,mph,windstorm,minutes,50 mph,low tide,choppy waves,beach
conflicts,thursday,streak,armed conflicts,couple,beyoncé,friday,rj davis,alabama basketball,disasters
noon,morning,trimet buses,friday,emergency,estimate,mph,afternoon,25 mph,ice storm
cctv footage,teenagers,camera,security cameras,suspects,motorcycles,doubts,bloodstains,clear doubts,panya khongsaengkham
cases,pipes,families,everybody,drinking water,crash,recovery,floor,frozen equipment,machines
mod fm
evacuation,israeli forces,rebels,bases,cease,israeli airstrikes,baghdad,egypt,widespread famine,disease
garden beds,beach accesses,surveillance,beaches,beach,south australia,queensland floodwaters,floodwaters,western australia,mainland australia
scene,krcg,tuesday,wednesday,interactive radar,bartley lane,thursday,morning,remains,regan mertz
2 changes: 2 additions & 0 deletions DOTS/output/test_gnews_dots_feats.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
escalation,agency inputs,concerns,fatalities,chief minister,87 fatalities,chinese citizens,previous attacks,february,ceasefire
families,famine,extreme suffering,carbohydrates,emergency levels,acute malnutrition,shocks,catastrophic hunger,catastrophe,disease
11 changes: 1 addition & 10 deletions DOTS/pull.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,6 @@ def process_hit(hit):
text.append(p.get_text())
return date,loc,title,org,per,theme,text,url


def process_hit_with_timeout(hit):
try:
return process_hit(hit)
except:
logging.debug(f"Grabbing the url stalled after 5s, skipping...")
return None


def process_data(data,fast=1):
articles = []
results=[]
Expand Down Expand Up @@ -159,7 +150,7 @@ def pull_data(articles):
except:
df = pd.DataFrame(data, columns=['title','id','url','title2'])
with concurrent.futures.ThreadPoolExecutor() as executor:
df['text'] = list(tqdm(executor.map(process_url, df['url']), total=len(df['url'])))
df['text'] = list(tqdm(executor.map(process_url, df['url']), total=len(df['url']),desc="grabbing text from url"))
return df['text'].values.tolist()


Expand Down
126 changes: 30 additions & 96 deletions DOTS/scrape.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,15 @@
os_url = os.getenv('OS_TOKEN')
lobstr_key = os.getenv('LOBSTR_KEY')

def get_OS_data(n):
def get_OS_data(n=20):
bash_command = f"""
curl -X GET "{os_url}/emergency-management-news/_search" -H 'Content-Type: application/json' -d '{{
curl -X GET "{os_url}/emergency-management-news/_search?scroll=1m" -H 'Content-Type: application/json' -d '{{
"_source": ["metadata.GDELT_DATE", "metadata.page_title","metadata.DocumentIdentifier", "metadata.Organizations","metadata.Persons","metadata.Themes","metadata.text", "metadata.Locations"],
"size": {n},
"slice": {{
"id": 0,
"max": 10
}},
"query": {{
"bool": {{
"must": [
Expand All @@ -29,30 +33,16 @@ def get_OS_data(n):
data = json.loads(output)
return data

def get_gnews_data(n):
bash_command = f"""
curl -X GET "{os_url}/test-google-news-index/_search" -H 'Content-Type: application/json' '{{
"_source": ["metadata.link", "metadata.title"],
"size": {n},
"query": {{
"bool": {{
"must": [
{{"match_all": {{}}}}
]
}}
}}
}}'
"""
process = subprocess.run(bash_command, shell=True, capture_output=True, text=True)
output = process.stdout
data = json.loads(output)
return data

def get_test_gnews(n):
def get_test_gnews(n=20):
bash_command = f"""
curl -X GET "{os_url}/test-google-news-index/_search" '{{
curl -X GET "{os_url}/test-google-news-index/_search?scroll=1m" '{{
"_source": ["metadata.link", "metadata.title"],
"size": {n},
"slice": {{
"id": 0,
"max": 100
}},
"query": {{
"bool": {{
"must": [
Expand All @@ -67,84 +57,28 @@ def get_test_gnews(n):
data = json.loads(output)
return data

# def get_npr_news(p):
# # Send a GET request to the NPR API
# r = requests.get("http://api.=1m.org/query?apiKey="+npr_key[0], params=p)

def get_massive_OS_data(t=1):
client = OpenSearch(os_url)
query = {
"size": "100",
"timeout": "10s",
"slice": {
"id": 0,
"max": 10
},
"query": {
"bool": {
"must": [
{"match_all": {}},
]}
},
"_source": ["metadata.GDELT_DATE", "metadata.page_title","metadata.DocumentIdentifier", "metadata.Organizations","metadata.Persons","metadata.Themes","metadata.text", "metadata.Locations"],
}
response = client.search(
scroll=str(t)+'m',
body=query,
)

return response, client

def get_google_news(theme,n=10000):
google_news = GNews()

google_news.period = '7d' # News from last 7 days
google_news.max_results = n # number of responses across a keyword
# google_news.country = 'United States' # News from a specific country
google_news.language = 'english' # News in a specific language
google_news.exclude_websites = ['yahoo.com', 'cnn.com'] # Exclude news from specific website i.e Yahoo.com and CNN.com
# google_news.start_date = (2024, 1, 1) # Search from 1st Jan 2020
# google_news.end_date = (2024, 3, 1) # Search until 1st March 2020

json_resp = google_news.get_news(theme)
article=[]

for i in tqdm(range(len(json_resp)), desc="grabbing directly from GoogleNews"):
aa=(google_news.get_full_article(json_resp[i]['url']))
try:
date=aa.publish_date.strftime("%d-%m-%Y")
except:
date=None
try:
title=aa.title
text=aa.text
except:
title=None
text=None
article.append([title,date,text])

return article


def get_npr_news(p):
# Send a GET request to the NPR API
r = requests.get("http://api.=1m.org/query?apiKey="+npr_key[0], params=p)

# Parse the XML response to get the story URLs
root = ET.fromstring(r.content)
story_urls = [story.find('link').text for story in root.iter('story')]
# # Parse the XML response to get the story URLs
# root = ET.fromstring(r.content)
# story_urls = [story.find('link').text for story in root.iter('story')]

# For each story URL, send a GET request to get the HTML content
full_stories = []
for url in story_urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# # For each story URL, send a GET request to get the HTML content
# full_stories = []
# for url in story_urls:
# response = requests.get(url)
# soup = BeautifulSoup(response.text, 'html.parser')

# Find the main content of the story. This will depend on the structure of the webpage.
# Here, we're assuming that the main content is in a <p> tag. You might need to adjust this depending on the webpage structure.
story = soup.find_all('p')
# # Find the main content of the story. This will depend on the structure of the webpage.
# # Here, we're assuming that the main content is in a <p> tag. You might need to adjust this depending on the webpage structure.
# story = soup.find_all('p')

# Extract the text from the story
full_story = ' '.join(p.text for p in story)
full_stories.append(full_story)
return full_stories
# # Extract the text from the story
# full_story = ' '.join(p.text for p in story)
# full_stories.append(full_story)
# return full_stories

def scrape_lobstr():
subprocess.run([
Expand Down
Loading

0 comments on commit 490bf9b

Please sign in to comment.