Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spider: Chicago Southwest Home Equity Commission I #676

Open
pjsier opened this issue Feb 3, 2019 · 14 comments
Open

Spider: Chicago Southwest Home Equity Commission I #676

pjsier opened this issue Feb 3, 2019 · 14 comments

Comments

@pjsier
Copy link
Collaborator

pjsier commented Feb 3, 2019

URL: https://swhomeequity.com/agenda-%26-minutes
Spider Name: chi_southwest_home_equity_i
Agency Name: Chicago Southwest Home Equity Commission I

See the contribution guide for information on how to get started

@mattpair
Copy link
Contributor

mattpair commented Apr 5, 2019

Hi, just cloned the repo. I'm going to make a branch for this spider

@mattpair
Copy link
Contributor

mattpair commented Apr 5, 2019

@pjsier looks like most info besides the date and meeting type == BOARD is contained in a pdf.

Let's discuss how to proceed but since this is my first, I think I'll start work on a more straightforward one.

@pjsier
Copy link
Collaborator Author

pjsier commented Apr 5, 2019

@mattpair that approach makes sense to me, let me know if you need any help finding a clearer spider to work on but we have a good amount available

@mattpair
Copy link
Contributor

mattpair commented May 8, 2019

I'll resume work on this one

@haidtang
Copy link

If this issue is unclaimed, I would like to work on that.

@pjsier
Copy link
Collaborator Author

pjsier commented Nov 18, 2019

@haidtang sure! In general we like to contributors to stick to one issue at a time, so I'll assign you to this one for now and not the other. Let me know if you'd like to switch that though

@haidtang
Copy link

@pjsier I got it. Could you please switch me to the other issue #566, I think that I have a better clue on how to deal with that one. Thank you so much.

@pjsier
Copy link
Collaborator Author

pjsier commented Nov 18, 2019

Sure thing!

@egfrank
Copy link
Contributor

egfrank commented Jan 24, 2020

Hey is this issue unclaimed? I'd be happy to work on this one if so. Also, it seems like it will involve reading pdfs; has that been done within this project before?

@pjsier
Copy link
Collaborator Author

pjsier commented Jan 24, 2020

It looks like it's been inactive for more than a month, so it's all yours if you're interested!

@egfrank
Copy link
Contributor

egfrank commented Feb 8, 2020

Hi @pjsier, most of information for these meetings is included in PDFs of the minutes and agenda for each meeting. Are there any other spiders that have downloaded / parsed PDFs in this project already?

Just working on my own computer, I'm able to parse the files using the package pdfminer.six which I just found via googling, but was the most highly recommend Python package for reading PDFs that I could. That package also requires that the files are downloaded, so I'm using tempfile which should delete the files after the text is extracted from them.

My question though is do you want to introduce that new package into the project? And is the fact that it needs to download files going to be a problem for running the spider on different computers?

@pjsier
Copy link
Collaborator Author

pjsier commented Feb 8, 2020

@egfrank thanks for checking that out! We're currently using PyPDF2, but we've used pdfminer.six on another project. Here's an example in chi_human_relations:

def _parse_schedule_pdf(self, response):
"""Parse dates and details from schedule PDF"""
pdf_obj = PdfFileReader(BytesIO(response.body))
pdf_text = pdf_obj.getPage(0).extractText().replace("\n", "")
# Remove duplicate characters not followed by lowercase (as in 5:00pm)
clean_text = re.sub(r"([A-Z0-9:])\1(?![a-z])", r"\1", pdf_text, flags=re.M)
# Remove duplicate spaces
clean_text = re.sub(r"\s+", " ", clean_text)
year_str = re.search(r"\d{4}", clean_text).group()
self._validate_location(clean_text)
for date_str in re.findall(r"[A-Z]{3,10}\s+\d{1,2}(?!\d)", clean_text):
self.meeting_starts.append(self._parse_start(date_str, year_str))

For now let's see if the parsing will work in PyPDF2, but if you run into more issues I think it would be fine to add pdfminer.six (which is in the Pipfile for city-scrapers-cle).

Related to the tempfile, you should be able to use BytesIO instead since we're using Python 3, and there's an example of that in the chi_human_relations example. BytesIO should work for both PyPDF2 and pdfminer.

Let me know if you run into any issues with this, and thanks again for doing that research!

@egfrank
Copy link
Contributor

egfrank commented Feb 8, 2020

Oh awesome I don't know why I missed that in the codebase! Sweet okay I'll look at PyPDF2 and BytesIO.

@egfrank
Copy link
Contributor

egfrank commented Sep 16, 2020

I finally looked back at this and opened up a new PR!
#973

Sorry about the delay - once my branch got out of date it was difficult to get the checks to pass and it ended being easier to start fresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants