Spider: Chicago Southwest Home Equity Commission I #676

pjsier · 2019-02-03T19:58:15Z

URL: https://swhomeequity.com/agenda-%26-minutes
Spider Name: chi_southwest_home_equity_i
Agency Name: Chicago Southwest Home Equity Commission I

See the contribution guide for information on how to get started

The text was updated successfully, but these errors were encountered:

mattpair · 2019-04-05T03:14:20Z

Hi, just cloned the repo. I'm going to make a branch for this spider

mattpair · 2019-04-05T03:34:43Z

@pjsier looks like most info besides the date and meeting type == BOARD is contained in a pdf.

Let's discuss how to proceed but since this is my first, I think I'll start work on a more straightforward one.

pjsier · 2019-04-05T12:47:19Z

@mattpair that approach makes sense to me, let me know if you need any help finding a clearer spider to work on but we have a good amount available

mattpair · 2019-05-08T21:29:51Z

I'll resume work on this one

haidtang · 2019-11-17T21:15:03Z

If this issue is unclaimed, I would like to work on that.

pjsier · 2019-11-18T00:46:08Z

@haidtang sure! In general we like to contributors to stick to one issue at a time, so I'll assign you to this one for now and not the other. Let me know if you'd like to switch that though

haidtang · 2019-11-18T01:10:38Z

@pjsier I got it. Could you please switch me to the other issue #566, I think that I have a better clue on how to deal with that one. Thank you so much.

pjsier · 2019-11-18T01:18:44Z

Sure thing!

egfrank · 2020-01-24T06:11:20Z

Hey is this issue unclaimed? I'd be happy to work on this one if so. Also, it seems like it will involve reading pdfs; has that been done within this project before?

pjsier · 2020-01-24T19:25:19Z

It looks like it's been inactive for more than a month, so it's all yours if you're interested!

egfrank · 2020-02-08T18:12:05Z

Hi @pjsier, most of information for these meetings is included in PDFs of the minutes and agenda for each meeting. Are there any other spiders that have downloaded / parsed PDFs in this project already?

Just working on my own computer, I'm able to parse the files using the package pdfminer.six which I just found via googling, but was the most highly recommend Python package for reading PDFs that I could. That package also requires that the files are downloaded, so I'm using tempfile which should delete the files after the text is extracted from them.

My question though is do you want to introduce that new package into the project? And is the fact that it needs to download files going to be a problem for running the spider on different computers?

pjsier · 2020-02-08T21:08:41Z

@egfrank thanks for checking that out! We're currently using PyPDF2, but we've used pdfminer.six on another project. Here's an example in chi_human_relations:

city-scrapers/city_scrapers/spiders/chi_human_relations.py

Lines 56 to 68 in a6a0ea8

    
           def _parse_schedule_pdf(self, response): 
        
               """Parse dates and details from schedule PDF""" 
        
               pdf_obj = PdfFileReader(BytesIO(response.body)) 
        
               pdf_text = pdf_obj.getPage(0).extractText().replace("\n", "") 
        
               # Remove duplicate characters not followed by lowercase (as in 5:00pm) 
        
               clean_text = re.sub(r"([A-Z0-9:])\1(?![a-z])", r"\1", pdf_text, flags=re.M) 
        
               # Remove duplicate spaces 
        
               clean_text = re.sub(r"\s+", " ", clean_text) 
        
               year_str = re.search(r"\d{4}", clean_text).group() 
        
               self._validate_location(clean_text) 
        
               for date_str in re.findall(r"[A-Z]{3,10}\s+\d{1,2}(?!\d)", clean_text): 
        
                   self.meeting_starts.append(self._parse_start(date_str, year_str))

For now let's see if the parsing will work in PyPDF2, but if you run into more issues I think it would be fine to add pdfminer.six (which is in the Pipfile for city-scrapers-cle).

Related to the tempfile, you should be able to use BytesIO instead since we're using Python 3, and there's an example of that in the chi_human_relations example. BytesIO should work for both PyPDF2 and pdfminer.

Let me know if you run into any issues with this, and thanks again for doing that research!

egfrank · 2020-02-08T21:20:23Z

Oh awesome I don't know why I missed that in the codebase! Sweet okay I'll look at PyPDF2 and BytesIO.

egfrank · 2020-09-16T21:02:26Z

I finally looked back at this and opened up a new PR!
#973

Sorry about the delay - once my branch got out of date it was difficult to get the checks to pass and it ended being easier to start fresh.

pjsier added good first issue help wanted new spider needed location: chicago labels Feb 3, 2019

pjsier added the Hacktoberfest label Oct 20, 2019

pjsier added claimed and removed help wanted labels Nov 18, 2019

pjsier added help wanted and removed claimed labels Nov 18, 2019

pjsier added claimed and removed help wanted labels Jan 24, 2020

egfrank mentioned this issue Feb 8, 2020

[Work in Progress] 0676 spiderchi southwest home equity i #946

Closed

5 tasks

pjsier removed the Hacktoberfest label Sep 3, 2020

egfrank mentioned this issue Sep 16, 2020

0676 spiderchi southwest home equity i #973

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spider: Chicago Southwest Home Equity Commission I #676

Spider: Chicago Southwest Home Equity Commission I #676

pjsier commented Feb 3, 2019

mattpair commented Apr 5, 2019

mattpair commented Apr 5, 2019 •

edited

Loading

pjsier commented Apr 5, 2019

mattpair commented May 8, 2019

haidtang commented Nov 17, 2019

pjsier commented Nov 18, 2019

haidtang commented Nov 18, 2019

pjsier commented Nov 18, 2019

egfrank commented Jan 24, 2020

pjsier commented Jan 24, 2020

egfrank commented Feb 8, 2020

pjsier commented Feb 8, 2020

egfrank commented Feb 8, 2020

egfrank commented Sep 16, 2020

Spider: Chicago Southwest Home Equity Commission I #676

Spider: Chicago Southwest Home Equity Commission I #676

Comments

pjsier commented Feb 3, 2019

mattpair commented Apr 5, 2019

mattpair commented Apr 5, 2019 • edited Loading

pjsier commented Apr 5, 2019

mattpair commented May 8, 2019

haidtang commented Nov 17, 2019

pjsier commented Nov 18, 2019

haidtang commented Nov 18, 2019

pjsier commented Nov 18, 2019

egfrank commented Jan 24, 2020

pjsier commented Jan 24, 2020

egfrank commented Feb 8, 2020

pjsier commented Feb 8, 2020

egfrank commented Feb 8, 2020

egfrank commented Sep 16, 2020

mattpair commented Apr 5, 2019 •

edited

Loading