New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

0676 spiderchi southwest home equity i #973

Open

egfrank wants to merge 8 commits into City-Bureau:main from egfrank:0676-spiderchi_southwest_home_equity_i

Contributor

egfrank commented Sep 16, 2020

Summary

Issue: #676

I worked on this back in February but then that spider got so outdated that I redid some of the logic and tests, here is new PR.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Tests are implemented
All tests are passing
Style checks run (see documentation for more details)
Style checks are passing
Code comments from template removed

egfrank added 5 commits

September 16, 2020 14:01


          Add southwest home equity scraper

9df5565


          Format and re-write tests

cbf69f5


          Black reformat for flake8

2bfa358


          remote comment

7cd97af


          black reformat

7df3bc8

egfrank mentioned this pull request

Spider: Chicago Southwest Home Equity Commission I #676

Open

pjsier requested changes

View reviewed changes

Collaborator

pjsier left a comment

Thanks for sticking with this one! Made some comments, let me know if any of them aren't clear

city_scrapers/spiders/chi_southwest_home_equity_i.py Outdated

+                  start_urls = ["https://swhomeequity.com/agenda-%26-minutes"]
+                  location = {
+                      "name": "Southwest Home Equity Assurance office",
+                      "address": "5334 W. 65th Street in Chicago, Illinois",

Collaborator

pjsier Sep 18, 2020

If this is static we can add the ZIP code and standardize the formatting a bit so that it's "5334 W 65th St Chicago, IL 60638"

city_scrapers/spiders/chi_southwest_home_equity_i.py

+                          "body/div/div/div/div[2]/div[3]/div/div/section/div/div[2]"
+                      )
+                      for agenda_node in table.xpath(

Collaborator

pjsier Sep 18, 2020

I think this approach makes sense, but their formatting doesn't seem very consistent, so we might have better luck using defaultdict to create a mapping of dates to links and then iterating over that to make sure we get everything. Here's an example in chi_lsc_advisory

city-scrapers/city_scrapers/spiders/chi_lsc_advisory.py

Lines 37 to 47 in a144119

    
           meeting_date_map = defaultdict(list) 
        
           meeting_text_map = defaultdict(list) 
        
           for item in response.css(".rich-text ul")[:1].css("li"): 
        
               start = self._parse_start(item, year, time_str) 
        
               if not start: 
        
                   continue 
        
               meeting_text_map[start].append(" ".join(item.css("*::text").extract())) 
        
               meeting_date_map[start].extend(self._parse_links(item, response)) 
        
           for start, links in meeting_date_map.items():

city_scrapers/spiders/chi_southwest_home_equity_i.py Outdated

+                          )
+                          if self._get_link(minutes_node):
+                              minutes_contents = self.parse_pdf(self._get_link(minutes_node))

Collaborator

pjsier Sep 18, 2020

Parsing the PDF adds a good amount of complexity here, including creating a new request outside of scrapy which ideally we'd like to avoid so that we can take advantage of scrapy's features like rate-limiting. I think it's fine if we just return "Governing Commission" as the title and don't parse the PDF content

city_scrapers/spiders/chi_southwest_home_equity_i.py Outdated

+                              minutes_contents = None
+                          meeting = Meeting(
+                              title=(

Collaborator

pjsier Sep 18, 2020

Mentioned above, but this can return "Governing Commission" for all cases

city_scrapers/spiders/chi_southwest_home_equity_i.py

+                                  else None
+                              ),
+                              location=self.location,
+                              links=self._parse_links([agenda_node, minutes_node]),

Collaborator

pjsier Sep 18, 2020

Using the defaultdict strategy would replace this

city_scrapers/spiders/chi_southwest_home_equity_i.py

+                      return links
+                  def _get_name(self, node):
+                      name = node.xpath("string(.)").get()

Collaborator

pjsier Sep 18, 2020

Since we know that the overwhelming majority of docs will be either Agenda or Minutes, we can check if either of those are present in the string and return "Agenda" or "Minutes" statically if so to keep things clean.

city_scrapers/spiders/chi_southwest_home_equity_i.py

+                  def _get_link(self, node):
+                      url = node.xpath("a/@href").get()
+                      if url:
+                          return "https:" + url

Collaborator

pjsier Sep 18, 2020

Just in case the scheme is ever added we should check if "http" is present before appending

city_scrapers/spiders/chi_southwest_home_equity_i.py

+                      links = []
+                      for node in nodes:
+                          if node:
+                              links.append(

Collaborator

pjsier Sep 18, 2020

It looks like _get_link can return None, so we'll need to check that before adding to the list because the href property should never be None

city_scrapers/spiders/chi_southwest_home_equity_i.py Outdated

+                      month_ind = re.search(MONTH_REGEX, text, flags=re.I)
+                      if month_ind is None:
+                          raise RuntimeError("No month found")

Collaborator

pjsier Sep 18, 2020

Looks like these are where the RuntimeError from above comes. It's generally easier to read if we return None instead of an exception, so that would be preferable here if we can also check it in the function using this output

city_scrapers/spiders/chi_southwest_home_equity_i.py Outdated

+                              start=self._parse_start(agenda_node),
+                              end=None,
+                              all_day=False,
+                              time_notes=(

Collaborator

pjsier Sep 18, 2020

Because we're assuming the time if we aren't parsing the minutes, we can add something statically like "See links for details"

egfrank added 3 commits

November 11, 2020 17:59


          Reformat static address

21420d9


          Remove pdf parsing, update date handling

06d9c2a


          Update tests

Collaborator

pjsier commented Nov 12, 2020

@egfrank thanks for continuing to work on this! Let me know when you want me to take another look. Looks like isort is failing right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet