Download inline images via org-download #28

et2010 · 2018-05-30T08:50:39Z

Thanks for this great package. I would like to download inline images via org-download during capturing but I don't know where to start. Is it possible? If so, could you shed some light on how to achieve this? Thanks!

alphapapa · 2018-05-31T00:04:31Z

Hi,

You're welcome, I'm glad you find it useful.

That's an interesting idea. I think it's possible, but it would take some work. I think it would go something like this:

Find images in captured HTML. This should probably be done before converting to Org format, so you could, e.g. use libxml to parse the HTML and find all img tags.
Find any of those images that have relative URLs and make them absolute URLs, using the URL of the captured page.
If there are any image URLs to be downloaded, choose a directory to store them in.
For each image URL, download it, and change the URL in the HTML to point to the local file, taking into account the eventual location of the Org file the captured page is going to be stored in.
Convert the HTML to Org and finish capturing it as usual.

Another possibility might be to use a different package altogether. For example, I haven't tried it yet, but org-board is designed to capture entire pages, including images. Maybe it could be modified to capture parts of pages instead of whole ones, or maybe it could be fed HTML from the Javascript bookmarklet in this package, and then it could handle the downloading and capturing itself. That might be a simpler way to handle it, compared to reimplementing some of org-board in this package. :)

What do you think?

yiufung · 2019-03-18T17:07:02Z

First thanks for creating the snippet. It's really useful to store local copy of important articles.

I digged for a while, and found that pandoc has an option --extract-media:

Extract images and other media contained in or linked from the source document to the path
DIR, creating it if necessary, and adjust the images references in the document so they point
to the extracted files.

So for Kitchen's page, I tried with:

pandoc -s http://kitchingroup.cheme.cmu.edu/blog/2014/07/17/Pandoc-does-org-mode-now/ -f html -t org --wrap=none --extract-media=/home/yiufung/test_pandoc > test_pandoc.org

And it works nicely, the org finds the correct path.

Then it seems to me that, if we pass in org-download-image-dir, picture link should be correct.

I took a look at org-protocol-capture-html.el and orb-web-tools, it seems data, not the url, is passed in. This is to use eww to extract readable parts right? Do you think there's a good way to combine both?

yiufung · 2019-03-18T17:46:56Z

use pandoc to download html in org-web-tools--get-url, pass in --extract-media
filter same eww-readable content from the downloaded html, continue conversion.

Seems that in this way we can get the best of both

yiufung · 2019-03-30T08:55:13Z

I figured out a snippet to download all image links using org-download

(defun search-forward-and-org-download-images()
      "Search forward for HTTP Image URLs, replace each using
org-download-image to obtain a local copy."
      (interactive)
      (while (re-search-forward org-bracket-link-regexp nil t)
        (let* (
               (end (match-end 0))
               (beg (match-beginning 0))
               (s (buffer-substring-no-properties beg end))
               (match? (string-match org-bracket-link-regexp s))
               (link (match-string 1 s))
               )
          (when (string-match "^http.*?\\.\\(?:png\\|jpg\\|jpeg\\)\\(.*\\)$"
                              link) ;; This is an image link
            (message (concat "Downloading image: "link))
            (delete-region beg end)
            (org-download-image link)
            (sleep-for 1) ;; Some sites dislike frequent requests
            ))))

So before finish the capture, I would run this command to download all images. Hope that's helpful.

alphapapa · 2019-03-31T08:10:21Z

That code doesn't have error checking, so if org-download-image fails, the link will remain deleted.

yiufung · 2019-03-31T16:46:47Z

Indeed, just a snippet as POC. Do you plan to integrate features like this? I mean, your tool with org-download makes archiving articles really easy, and I think it will be quite useful for many others too.

alphapapa · 2019-04-01T14:41:23Z

Maybe sometime. I haven't been able to use this tool lately because I haven't been able to get my browser to work with the MIME type to connect it to emacsclient. So I've been using org-web-tools to archive pages, which uses wget or archive.is, which downloads images, etc.

Anton-Latukha · 2020-01-14T20:13:39Z

@alphapapa We already supplied a couple of instructions on how to register the org-protocol handler in pull requests:

https://github.com/alphapapa/org-protocol-capture-html/pull/33/files

https://github.com/alphapapa/org-protocol-capture-html/pull/36/files

Maybe those guides would help.

I also during some updates use to lose the handler also, and by using my guide I redid the procedure - it helped.

alphapapa · 2020-01-14T21:12:04Z

@Anton-Latukha Just curious, is that the "royal we" or do you represent more than just yourself? :) I'll post on those PRs again.

Anton-Latukha · 2020-01-16T10:28:39Z

I really love myself. Even messaged in two links to two PRs of two different people. That share info on the topic you were concerned about. If we merge them - we even may help other people with that. I just helping to progress this discussion by helping you solve #28 (comment)

sbwcwso · 2023-05-20T13:01:25Z

Now, the image links are stripped of the website prefix， how can I fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download inline images via org-download #28

Download inline images via org-download #28

et2010 commented May 30, 2018

alphapapa commented May 31, 2018

yiufung commented Mar 18, 2019 •

edited

Loading

yiufung commented Mar 18, 2019

yiufung commented Mar 30, 2019

alphapapa commented Mar 31, 2019

yiufung commented Mar 31, 2019

alphapapa commented Apr 1, 2019

Anton-Latukha commented Jan 14, 2020 •

edited

Loading

alphapapa commented Jan 14, 2020

Anton-Latukha commented Jan 16, 2020 •

edited

Loading

sbwcwso commented May 20, 2023

Download inline images via org-download #28

Download inline images via org-download #28

Comments

et2010 commented May 30, 2018

alphapapa commented May 31, 2018

yiufung commented Mar 18, 2019 • edited Loading

yiufung commented Mar 18, 2019

yiufung commented Mar 30, 2019

alphapapa commented Mar 31, 2019

yiufung commented Mar 31, 2019

alphapapa commented Apr 1, 2019

Anton-Latukha commented Jan 14, 2020 • edited Loading

alphapapa commented Jan 14, 2020

Anton-Latukha commented Jan 16, 2020 • edited Loading

sbwcwso commented May 20, 2023

yiufung commented Mar 18, 2019 •

edited

Loading

Anton-Latukha commented Jan 14, 2020 •

edited

Loading

Anton-Latukha commented Jan 16, 2020 •

edited

Loading