-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parser of html links #87
Conversation
Viva la SIGIL , I dont know why |
😮 This is going to be the foundation of the PyCZ |
Or maybe a part of a generic link finding plugin. 🤔 But, it may be too challenging to develop a plugin architecture. I’d put that to the Spider and think about extracting it to a plugin later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking more, I concluded that providing a set of the most commonly used scraping tools would make the project much more helpful. The goal of Onigumo is to automate the common ground for scraping and to provide a toolchain nicely fits.
For now:
- This functionality does not belong to Onigumo Parser. Let’s move it to some Plugins or Tools namespace and name it more precisely, e.g., LinkExtractor, LinkFinder, Links, or similar.
- Rebase on the current elixir to clean the diff of the Hash-related changes.
After that, I think this will be up to a proper review. Thanks for the excellent work!
I merged in the current master to minimize the number of conflicts. The pull request is now clean of unrelated changes. |
You expressed uncertainty on what to do with this pull request. As mentioned in my review, I think the link finder belongs instead of the Parser to a standalone module in a toolbox/plugin namespace. I can’t think now of an ideal architecture. Thus I won’t object to anything that isn’t wrong. The Parser’s job is to check for the downloaded raw data and pass it to the Spider’s concrete parser, not to do the parsing itself. Is it clear now? Don’t hesitate to raise any questions. The changeset is clear after the merge, and we agreed that we want this piece of code in the application. I will do a proper review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am only requesting changes because of the module name. As mentioned before, this functionality is not Onigumo’s Parser. All other comments are only suggestions or questions, and I don’t need to block the pull request because of them. We can improve on them in follow-up ones.
lib/parser.ex
Outdated
def html_links(document) do | ||
Floki.parse_document!(document) | ||
|> Floki.find("a") | ||
|> Floki.attribute("href") |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
Co-authored-by: Glutexo <[email protected]>
- tool for finding in html is not part of onigumo Kernel Parser - it is handy tool for spiders, most of spiders will need some basic html url finder, it makes senses to prepare some basic func for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only one real objection, marked with
mix.exs
Outdated
{:mox, "~> 1.0", only: :test}, | ||
|
||
# Toolbox dependencies | ||
{:floki, "~> 0.32.0"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{:floki, "~> 0.32.0"} | |
{:floki, "~> 0.32.1"} |
Do we want to specify the latest version at the creation of the pull request or round it down to zero? What is the convention?
Answer: ~>
means >=
the specified patch version, but ==
the minor version. I would use 0.32.1 now, but we can bump all versions in a separate pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mentioned that in #161 (comment).
I also manually tested the new find_links function, and it works well: $ iex -S mix iex(1)> Spider.HTML.find_links(~s(<a href="http://www.seznam.cz/"></a><a class="link" href="http://www.mapy.cz/"></a><a id="nothing"></a>))
["http://www.seznam.cz/", "http://www.mapy.cz/"] |
toolbox are for spider so, it'll be better to name as spider toolbox Co-authored-by: Glutexo <[email protected]>
rename the test, correspond with change of parser to spider Co-authored-by: Glutexo <[email protected]>
Update test text to be more precise
Switch id "b" to "nothing" to be more precise, "b" looks like as placeholder Co-authored-by: Glutexo <[email protected]>
Elixir library From documentation of Floki See: iex>Floki.parse_document("<html><head></head><body>hello</body></html>")
iex>{:ok, [{"html", [], [{"head", [], []}, {"body", [], ["hello"]}]}]}
iex>Floki.parse_document("<html><head></head><body>hello</body></html>", html_parser: Floki.HTMLParser.Mochiweb)
iex>{:ok, [{"html", [], [{"head", [], []}, {"body", [], ["hello"]}]}]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@html ~s(<!doctype html> | ||
<html> | ||
<head> | ||
<link href="/media/examples/link-element-example.css" rel="stylesheet"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Parser which is able to convert string document to html structure, find the
a
tag a then get the href value.