Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parser of html links #87

Merged
merged 22 commits into from
Feb 5, 2023
Merged

Add parser of html links #87

merged 22 commits into from
Feb 5, 2023

Conversation

nappex
Copy link
Collaborator

@nappex nappex commented Mar 27, 2022

Parser which is able to convert string document to html structure, find the a tag a then get the href value.

@nappex
Copy link
Collaborator Author

nappex commented Mar 27, 2022

Viva la SIGIL , I dont know why

@Glutexo
Copy link
Owner

Glutexo commented Mar 30, 2022

😮 This is going to be the foundation of the PyCZ plugin application/Spider. 👍🏻 I don’t know where this should belong yet, but a plain module may be enough for now.

@Glutexo
Copy link
Owner

Glutexo commented Apr 13, 2022

Or maybe a part of a generic link finding plugin. 🤔 But, it may be too challenging to develop a plugin architecture. I’d put that to the Spider and think about extracting it to a plugin later.

Copy link
Owner

@Glutexo Glutexo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking more, I concluded that providing a set of the most commonly used scraping tools would make the project much more helpful. The goal of Onigumo is to automate the common ground for scraping and to provide a toolchain nicely fits.

For now:

  • This functionality does not belong to Onigumo Parser. Let’s move it to some Plugins or Tools namespace and name it more precisely, e.g., LinkExtractor, LinkFinder, Links, or similar.
  • Rebase on the current elixir to clean the diff of the Hash-related changes.
    After that, I think this will be up to a proper review. Thanks for the excellent work!

@Glutexo
Copy link
Owner

Glutexo commented Jul 20, 2022

I merged in the current master to minimize the number of conflicts. The pull request is now clean of unrelated changes.

@Glutexo
Copy link
Owner

Glutexo commented Jul 20, 2022

You expressed uncertainty on what to do with this pull request. As mentioned in my review, I think the link finder belongs instead of the Parser to a standalone module in a toolbox/plugin namespace.

I can’t think now of an ideal architecture. Thus I won’t object to anything that isn’t wrong. The Parser’s job is to check for the downloaded raw data and pass it to the Spider’s concrete parser, not to do the parsing itself.

Is it clear now? Don’t hesitate to raise any questions.

The changeset is clear after the merge, and we agreed that we want this piece of code in the application. I will do a proper review.

Copy link
Owner

@Glutexo Glutexo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only requesting changes because of the module name. As mentioned before, this functionality is not Onigumo’s Parser. All other comments are only suggestions or questions, and I don’t need to block the pull request because of them. We can improve on them in follow-up ones.

mix.exs Outdated Show resolved Hide resolved
lib/parser.ex Outdated
def html_links(document) do
Floki.parse_document!(document)
|> Floki.find("a")
|> Floki.attribute("href")

This comment was marked as resolved.

lib/parser.ex Outdated Show resolved Hide resolved
mix.exs Outdated Show resolved Hide resolved
test/parse_test.exs Outdated Show resolved Hide resolved
test/parse_test.exs Outdated Show resolved Hide resolved
test/parse_test.exs Outdated Show resolved Hide resolved
test/parse_test.exs Outdated Show resolved Hide resolved
test/parse_test.exs Outdated Show resolved Hide resolved
mix.exs Outdated Show resolved Hide resolved
Copy link
Owner

@Glutexo Glutexo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only one real objection, marked with ⚠️. Everything else is only light suggestions. Great job! 💪🏻

lib/spider_html.ex Show resolved Hide resolved
mix.exs Outdated Show resolved Hide resolved
mix.exs Outdated
{:mox, "~> 1.0", only: :test},

# Toolbox dependencies
{:floki, "~> 0.32.0"}
Copy link
Owner

@Glutexo Glutexo Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{:floki, "~> 0.32.0"}
{:floki, "~> 0.32.1"}

Do we want to specify the latest version at the creation of the pull request or round it down to zero? What is the convention?

Answer: ~> means >= the specified patch version, but == the minor version. I would use 0.32.1 now, but we can bump all versions in a separate pull request.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mentioned that in #161 (comment).

mix.lock Show resolved Hide resolved
test/spider_html_test.exs Outdated Show resolved Hide resolved
test/spider_html_test.exs Outdated Show resolved Hide resolved
test/spider_html_test.exs Show resolved Hide resolved
test/spider_html_test.exs Outdated Show resolved Hide resolved
test/spider_html_test.exs Show resolved Hide resolved
test/spider_html_test.exs Outdated Show resolved Hide resolved
@Glutexo
Copy link
Owner

Glutexo commented Jan 23, 2023

I also manually tested the new find_links function, and it works well:

$ iex -S mix
iex(1)> Spider.HTML.find_links(~s(<a href="http://www.seznam.cz/"></a><a class="link" href="http://www.mapy.cz/"></a><a id="nothing"></a>))        
["http://www.seznam.cz/", "http://www.mapy.cz/"]

nappex and others added 5 commits February 5, 2023 17:03
toolbox are for spider so, it'll be better to name as spider toolbox

Co-authored-by: Glutexo <[email protected]>
rename the test, correspond with change of parser to spider

Co-authored-by: Glutexo <[email protected]>
Update test text to be more precise
Switch id "b" to "nothing" to be more precise, "b" looks like as placeholder

Co-authored-by: Glutexo <[email protected]>
@nappex
Copy link
Collaborator Author

nappex commented Feb 5, 2023

Elixir library Floki uses parse_document or parse_fragment. By default there is no difference between them because parse_fragment is just wrapper of parse_document. But this behaviour is just for default Floki.HTMLParser.Mochiweb, you could use your own HTMLParser which defined different behave of these functions.

From documentation of Floki
parse_document(document, opts \\ []) the opts is for specification of any HTMLParser

See:

iex>Floki.parse_document("<html><head></head><body>hello</body></html>")
iex>{:ok, [{"html", [], [{"head", [], []}, {"body", [], ["hello"]}]}]}

iex>Floki.parse_document("<html><head></head><body>hello</body></html>", html_parser: Floki.HTMLParser.Mochiweb)
iex>{:ok, [{"html", [], [{"head", [], []}, {"body", [], ["hello"]}]}]}

Copy link
Owner

@Glutexo Glutexo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@html ~s(<!doctype html>
<html>
<head>
<link href="/media/examples/link-element-example.css" rel="stylesheet">
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@Glutexo Glutexo merged commit fc4455f into Glutexo:master Feb 5, 2023
@Glutexo Glutexo deleted the links-parser branch February 5, 2023 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants