-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0676 spiderchi southwest home equity i #973
base: main
Are you sure you want to change the base?
0676 spiderchi southwest home equity i #973
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this one! Made some comments, let me know if any of them aren't clear
start_urls = ["https://swhomeequity.com/agenda-%26-minutes"] | ||
location = { | ||
"name": "Southwest Home Equity Assurance office", | ||
"address": "5334 W. 65th Street in Chicago, Illinois", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is static we can add the ZIP code and standardize the formatting a bit so that it's "5334 W 65th St Chicago, IL 60638"
"body/div/div/div/div[2]/div[3]/div/div/section/div/div[2]" | ||
) | ||
|
||
for agenda_node in table.xpath( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach makes sense, but their formatting doesn't seem very consistent, so we might have better luck using defaultdict
to create a mapping of dates to links and then iterating over that to make sure we get everything. Here's an example in chi_lsc_advisory
city-scrapers/city_scrapers/spiders/chi_lsc_advisory.py
Lines 37 to 47 in a144119
meeting_date_map = defaultdict(list) | |
meeting_text_map = defaultdict(list) | |
for item in response.css(".rich-text ul")[:1].css("li"): | |
start = self._parse_start(item, year, time_str) | |
if not start: | |
continue | |
meeting_text_map[start].append(" ".join(item.css("*::text").extract())) | |
meeting_date_map[start].extend(self._parse_links(item, response)) | |
for start, links in meeting_date_map.items(): |
) | ||
|
||
if self._get_link(minutes_node): | ||
minutes_contents = self.parse_pdf(self._get_link(minutes_node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing the PDF adds a good amount of complexity here, including creating a new request outside of scrapy which ideally we'd like to avoid so that we can take advantage of scrapy's features like rate-limiting. I think it's fine if we just return "Governing Commission" as the title and don't parse the PDF content
minutes_contents = None | ||
|
||
meeting = Meeting( | ||
title=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned above, but this can return "Governing Commission" for all cases
else None | ||
), | ||
location=self.location, | ||
links=self._parse_links([agenda_node, minutes_node]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the defaultdict
strategy would replace this
return links | ||
|
||
def _get_name(self, node): | ||
name = node.xpath("string(.)").get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we know that the overwhelming majority of docs will be either Agenda or Minutes, we can check if either of those are present in the string and return "Agenda" or "Minutes" statically if so to keep things clean.
def _get_link(self, node): | ||
url = node.xpath("a/@href").get() | ||
if url: | ||
return "https:" + url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case the scheme is ever added we should check if "http" is present before appending
links = [] | ||
for node in nodes: | ||
if node: | ||
links.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like _get_link
can return None
, so we'll need to check that before adding to the list because the href
property should never be None
|
||
month_ind = re.search(MONTH_REGEX, text, flags=re.I) | ||
if month_ind is None: | ||
raise RuntimeError("No month found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these are where the RuntimeError
from above comes. It's generally easier to read if we return None
instead of an exception, so that would be preferable here if we can also check it in the function using this output
start=self._parse_start(agenda_node), | ||
end=None, | ||
all_day=False, | ||
time_notes=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we're assuming the time if we aren't parsing the minutes, we can add something statically like "See links for details"
@egfrank thanks for continuing to work on this! Let me know when you want me to take another look. Looks like |
Summary
Issue: #676
I worked on this back in February but then that spider got so outdated that I redid some of the logic and tests, here is new PR.
Checklist
All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.