Add parser of html links #87

nappex · 2022-03-27T19:26:28Z

Parser which is able to convert string document to html structure, find the a tag a then get the href value.

nappex · 2022-03-27T19:43:41Z

Viva la SIGIL , I dont know why

Glutexo · 2022-03-30T11:06:55Z

😮 This is going to be the foundation of the PyCZ ~~plugin~~ application/Spider. 👍🏻 I don’t know where this should belong yet, but a plain module may be enough for now.

Glutexo · 2022-04-13T16:20:17Z

Or maybe a part of a generic link finding plugin. 🤔 But, it may be too challenging to develop a plugin architecture. I’d put that to the Spider and think about extracting it to a plugin later.

Glutexo

After thinking more, I concluded that providing a set of the most commonly used scraping tools would make the project much more helpful. The goal of Onigumo is to automate the common ground for scraping and to provide a toolchain nicely fits.

For now:

This functionality does not belong to Onigumo Parser. Let’s move it to some Plugins or Tools namespace and name it more precisely, e.g., LinkExtractor, LinkFinder, Links, or similar.
Rebase on the current elixir to clean the diff of the Hash-related changes.
After that, I think this will be up to a proper review. Thanks for the excellent work!

Glutexo · 2022-07-20T15:40:27Z

I merged in the current master to minimize the number of conflicts. The pull request is now clean of unrelated changes.

Glutexo · 2022-07-20T15:53:48Z

You expressed uncertainty on what to do with this pull request. As mentioned in my review, I think the link finder belongs instead of the Parser to a standalone module in a toolbox/plugin namespace.

I can’t think now of an ideal architecture. Thus I won’t object to anything that isn’t wrong. The Parser’s job is to check for the downloaded raw data and pass it to the Spider’s concrete parser, not to do the parsing itself.

Is it clear now? Don’t hesitate to raise any questions.

The changeset is clear after the merge, and we agreed that we want this piece of code in the application. I will do a proper review.

Glutexo

I am only requesting changes because of the module name. As mentioned before, this functionality is not Onigumo’s Parser. All other comments are only suggestions or questions, and I don’t need to block the pull request because of them. We can improve on them in follow-up ones.

mix.exs

lib/parser.ex

+  def html_links(document) do
+    Floki.parse_document!(document)
+    |> Floki.find("a")
+    |> Floki.attribute("href")


lib/parser.ex

mix.exs

test/parse_test.exs

Co-authored-by: Glutexo <[email protected]>

mix.exs

- tool for finding in html is not part of onigumo Kernel Parser - it is handy tool for spiders, most of spiders will need some basic html url finder, it makes senses to prepare some basic func for that.

Glutexo

I have only one real objection, marked with ⚠️. Everything else is only light suggestions. Great job! 💪🏻

lib/spider_html.ex

mix.exs

Glutexo · 2023-01-23T12:53:09Z

mix.exs

+      {:mox, "~> 1.0", only: :test},
+
+      # Toolbox dependencies
+      {:floki, "~> 0.32.0"}


Suggested change

{:floki, "~> 0.32.0"}

{:floki, "~> 0.32.1"}

Do we want to specify the latest version at the creation of the pull request or round it down to zero? What is the convention?

Answer: ~> means >= the specified patch version, but == the minor version. I would use 0.32.1 now, but we can bump all versions in a separate pull request.

Oh, you mentioned that in #161 (comment).

mix.lock

test/spider_html_test.exs

Glutexo · 2023-01-23T15:35:38Z

I also manually tested the new find_links function, and it works well:

$ iex -S mix

iex(1)> Spider.HTML.find_links(~s(<a href="http://www.seznam.cz/"></a><a class="link" href="http://www.mapy.cz/"></a><a id="nothing"></a>))        
["http://www.seznam.cz/", "http://www.mapy.cz/"]

toolbox are for spider so, it'll be better to name as spider toolbox Co-authored-by: Glutexo <[email protected]>

rename the test, correspond with change of parser to spider Co-authored-by: Glutexo <[email protected]>

Update test text to be more precise

Switch id "b" to "nothing" to be more precise, "b" looks like as placeholder Co-authored-by: Glutexo <[email protected]>

nappex · 2023-02-05T18:10:40Z

Elixir library Floki uses parse_document or parse_fragment. By default there is no difference between them because parse_fragment is just wrapper of parse_document. But this behaviour is just for default Floki.HTMLParser.Mochiweb, you could use your own HTMLParser which defined different behave of these functions.

From documentation of Floki
parse_document(document, opts \\ []) the opts is for specification of any HTMLParser

See:

iex>Floki.parse_document("<html><head></head><body>hello</body></html>")
iex>{:ok, [{"html", [], [{"head", [], []}, {"body", [], ["hello"]}]}]}

iex>Floki.parse_document("<html><head></head><body>hello</body></html>", html_parser: Floki.HTMLParser.Mochiweb)
iex>{:ok, [{"html", [], [{"head", [], []}, {"body", [], ["hello"]}]}]}

…'a' tag

Glutexo

Glutexo · 2023-02-05T18:20:01Z

test/spider_html_test.exs

+  @html ~s(<!doctype html>
+<html>
+    <head>
+        <link href="/media/examples/link-element-example.css" rel="stylesheet">


nappex added 3 commits March 27, 2022 21:21

Add floki to dependencies

c55daf8

Add test for parsing html links

5c36f40

Add parser of html links

9c9a668

Glutexo force-pushed the elixir branch from 6a0884e to 048e7f5 Compare April 15, 2022 16:36

Glutexo added the plugins label Jul 9, 2022

Glutexo assigned nappex Jul 9, 2022

Glutexo requested changes Jul 10, 2022

View reviewed changes

Merge branch 'master' into links-parser

17612b2

Glutexo requested changes Jul 22, 2022

View reviewed changes

Glutexo mentioned this pull request Jul 29, 2022

📝 Swap the word parsed for structured #122

Merged

Glutexo mentioned this pull request Aug 24, 2022

Create module parser #89

Open

Add header to toolbox dependencies

6d7a44f

Co-authored-by: Glutexo <[email protected]>

Glutexo reviewed Jan 21, 2023

View reviewed changes

mix.exs Outdated Show resolved Hide resolved

nappex added 11 commits January 21, 2023 20:31

Move parser.ex to spider_html.ex

7d55762

- tool for finding in html is not part of onigumo Kernel Parser - it is handy tool for spiders, most of spiders will need some basic html url finder, it makes senses to prepare some basic func for that.

Rename document to hmtl

fff0bb9

Make from html sigil a constant

def56d9

Fix sigil content, indentation and remove double qoutes

6422b03

Add a tag without href to test if error will invoke

9d48de1

Fix path to module and function find_links

14a3f1a

Rename var result to links

f38c140

Add describe to test

2b8df13

Rename test file to spider_html, to keep changes from parser to spider

d5b753c

Merge branch 'master' into links-parser

5068ba6

Fix slash from left to right

af9da82

Glutexo requested changes Jan 23, 2023

View reviewed changes

This was referenced Jan 23, 2023

✅ Use third-person indicative in test names #166

Open

⬆️ Bump dependencies versions #167

Closed

♻️ Improve module naming and directory structure #163

Closed

nappex and others added 5 commits February 5, 2023 17:03

Update mix.exs suitable name of toolbox

57f681c

toolbox are for spider so, it'll be better to name as spider toolbox Co-authored-by: Glutexo <[email protected]>

Update test/spider_html_test.exs

ae8dc30

rename the test, correspond with change of parser to spider Co-authored-by: Glutexo <[email protected]>

Update test/spider_html_test.exs

4a53137

Update test text to be more precise

Fix floki definition for mix.exs to be consistent

485c904

Update test/spider_html_test.exs

792a512

Switch id "b" to "nothing" to be more precise, "b" looks like as placeholder Co-authored-by: Glutexo <[email protected]>

Add other tag with href than 'a' to test if we collect href only for …

36a9324

…'a' tag

Glutexo approved these changes Feb 5, 2023

View reviewed changes

Glutexo merged commit fc4455f into Glutexo:master Feb 5, 2023

Glutexo deleted the links-parser branch February 5, 2023 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parser of html links #87

Add parser of html links #87

nappex commented Mar 27, 2022

nappex commented Mar 27, 2022

Glutexo commented Mar 30, 2022 •

edited

Loading

Glutexo commented Apr 13, 2022

Glutexo left a comment •

edited

Loading

Glutexo commented Jul 20, 2022

Glutexo commented Jul 20, 2022 •

edited

Loading

Glutexo left a comment •

edited

Loading

This comment was marked as resolved.

Glutexo left a comment

Glutexo Jan 23, 2023 •

edited

Loading

Glutexo Jan 23, 2023

Glutexo commented Jan 23, 2023

nappex commented Feb 5, 2023 •

edited

Loading

Glutexo left a comment

Glutexo Feb 5, 2023

Add parser of html links #87

Add parser of html links #87

Conversation

nappex commented Mar 27, 2022

nappex commented Mar 27, 2022

Glutexo commented Mar 30, 2022 • edited Loading

Glutexo commented Apr 13, 2022

Glutexo left a comment • edited Loading

Choose a reason for hiding this comment

Glutexo commented Jul 20, 2022

Glutexo commented Jul 20, 2022 • edited Loading

Glutexo left a comment • edited Loading

Choose a reason for hiding this comment

This comment was marked as resolved.

Glutexo left a comment

Choose a reason for hiding this comment

Glutexo Jan 23, 2023 • edited Loading

Choose a reason for hiding this comment

Glutexo Jan 23, 2023

Choose a reason for hiding this comment

Glutexo commented Jan 23, 2023

nappex commented Feb 5, 2023 • edited Loading

Glutexo left a comment

Choose a reason for hiding this comment

Glutexo Feb 5, 2023

Choose a reason for hiding this comment

Glutexo commented Mar 30, 2022 •

edited

Loading

Glutexo left a comment •

edited

Loading

Glutexo commented Jul 20, 2022 •

edited

Loading

Glutexo left a comment •

edited

Loading

Glutexo Jan 23, 2023 •

edited

Loading

nappex commented Feb 5, 2023 •

edited

Loading