Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0676 spiderchi southwest home equity i #973

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

egfrank
Copy link
Contributor

@egfrank egfrank commented Sep 16, 2020

Summary

Issue: #676

I worked on this back in February but then that spider got so outdated that I redid some of the logic and tests, here is new PR.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

  • Tests are implemented
  • All tests are passing
  • Style checks run (see documentation for more details)
  • Style checks are passing
  • Code comments from template removed

Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this one! Made some comments, let me know if any of them aren't clear

start_urls = ["https://swhomeequity.com/agenda-%26-minutes"]
location = {
"name": "Southwest Home Equity Assurance office",
"address": "5334 W. 65th Street in Chicago, Illinois",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is static we can add the ZIP code and standardize the formatting a bit so that it's "5334 W 65th St Chicago, IL 60638"

"body/div/div/div/div[2]/div[3]/div/div/section/div/div[2]"
)

for agenda_node in table.xpath(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach makes sense, but their formatting doesn't seem very consistent, so we might have better luck using defaultdict to create a mapping of dates to links and then iterating over that to make sure we get everything. Here's an example in chi_lsc_advisory

meeting_date_map = defaultdict(list)
meeting_text_map = defaultdict(list)
for item in response.css(".rich-text ul")[:1].css("li"):
start = self._parse_start(item, year, time_str)
if not start:
continue
meeting_text_map[start].append(" ".join(item.css("*::text").extract()))
meeting_date_map[start].extend(self._parse_links(item, response))
for start, links in meeting_date_map.items():

)

if self._get_link(minutes_node):
minutes_contents = self.parse_pdf(self._get_link(minutes_node))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing the PDF adds a good amount of complexity here, including creating a new request outside of scrapy which ideally we'd like to avoid so that we can take advantage of scrapy's features like rate-limiting. I think it's fine if we just return "Governing Commission" as the title and don't parse the PDF content

minutes_contents = None

meeting = Meeting(
title=(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned above, but this can return "Governing Commission" for all cases

else None
),
location=self.location,
links=self._parse_links([agenda_node, minutes_node]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the defaultdict strategy would replace this

return links

def _get_name(self, node):
name = node.xpath("string(.)").get()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we know that the overwhelming majority of docs will be either Agenda or Minutes, we can check if either of those are present in the string and return "Agenda" or "Minutes" statically if so to keep things clean.

def _get_link(self, node):
url = node.xpath("a/@href").get()
if url:
return "https:" + url
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case the scheme is ever added we should check if "http" is present before appending

links = []
for node in nodes:
if node:
links.append(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like _get_link can return None, so we'll need to check that before adding to the list because the href property should never be None


month_ind = re.search(MONTH_REGEX, text, flags=re.I)
if month_ind is None:
raise RuntimeError("No month found")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these are where the RuntimeError from above comes. It's generally easier to read if we return None instead of an exception, so that would be preferable here if we can also check it in the function using this output

start=self._parse_start(agenda_node),
end=None,
all_day=False,
time_notes=(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're assuming the time if we aren't parsing the minutes, we can add something statically like "See links for details"

@pjsier
Copy link
Collaborator

pjsier commented Nov 12, 2020

@egfrank thanks for continuing to work on this! Let me know when you want me to take another look. Looks like isort is failing right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants