Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download inline images via org-download #28

Open
et2010 opened this issue May 30, 2018 · 11 comments
Open

Download inline images via org-download #28

et2010 opened this issue May 30, 2018 · 11 comments

Comments

@et2010
Copy link

et2010 commented May 30, 2018

Hi, @alphapapa

Thanks for this great package. I would like to download inline images via org-download during capturing but I don't know where to start. Is it possible? If so, could you shed some light on how to achieve this? Thanks!

@alphapapa
Copy link
Owner

Hi,

You're welcome, I'm glad you find it useful.

That's an interesting idea. I think it's possible, but it would take some work. I think it would go something like this:

  1. Find images in captured HTML. This should probably be done before converting to Org format, so you could, e.g. use libxml to parse the HTML and find all img tags.
  2. Find any of those images that have relative URLs and make them absolute URLs, using the URL of the captured page.
  3. If there are any image URLs to be downloaded, choose a directory to store them in.
  4. For each image URL, download it, and change the URL in the HTML to point to the local file, taking into account the eventual location of the Org file the captured page is going to be stored in.
  5. Convert the HTML to Org and finish capturing it as usual.

Another possibility might be to use a different package altogether. For example, I haven't tried it yet, but org-board is designed to capture entire pages, including images. Maybe it could be modified to capture parts of pages instead of whole ones, or maybe it could be fed HTML from the Javascript bookmarklet in this package, and then it could handle the downloading and capturing itself. That might be a simpler way to handle it, compared to reimplementing some of org-board in this package. :)

What do you think?

@yiufung
Copy link

yiufung commented Mar 18, 2019

First thanks for creating the snippet. It's really useful to store local copy of important articles.

I digged for a while, and found that pandoc has an option --extract-media:

Extract images and other media contained in or linked from the source document to the path
DIR, creating it if necessary, and adjust the images references in the document so they point
to the extracted files.

So for Kitchen's page, I tried with:

pandoc -s http://kitchingroup.cheme.cmu.edu/blog/2014/07/17/Pandoc-does-org-mode-now/ -f html -t org --wrap=none --extract-media=/home/yiufung/test_pandoc > test_pandoc.org

And it works nicely, the org finds the correct path.

Then it seems to me that, if we pass in org-download-image-dir, picture link should be correct.

I took a look at org-protocol-capture-html.el and orb-web-tools, it seems data, not the url, is passed in. This is to use eww to extract readable parts right? Do you think there's a good way to combine both?

@yiufung
Copy link

yiufung commented Mar 18, 2019

  1. use pandoc to download html in org-web-tools--get-url, pass in --extract-media
  2. filter same eww-readable content from the downloaded html, continue conversion.

Seems that in this way we can get the best of both

@yiufung
Copy link

yiufung commented Mar 30, 2019

I figured out a snippet to download all image links using org-download

(defun search-forward-and-org-download-images()
      "Search forward for HTTP Image URLs, replace each using
org-download-image to obtain a local copy."
      (interactive)
      (while (re-search-forward org-bracket-link-regexp nil t)
        (let* (
               (end (match-end 0))
               (beg (match-beginning 0))
               (s (buffer-substring-no-properties beg end))
               (match? (string-match org-bracket-link-regexp s))
               (link (match-string 1 s))
               )
          (when (string-match "^http.*?\\.\\(?:png\\|jpg\\|jpeg\\)\\(.*\\)$"
                              link) ;; This is an image link
            (message (concat "Downloading image: "link))
            (delete-region beg end)
            (org-download-image link)
            (sleep-for 1) ;; Some sites dislike frequent requests
            ))))

So before finish the capture, I would run this command to download all images. Hope that's helpful.

@alphapapa
Copy link
Owner

That code doesn't have error checking, so if org-download-image fails, the link will remain deleted.

@yiufung
Copy link

yiufung commented Mar 31, 2019

Indeed, just a snippet as POC. Do you plan to integrate features like this? I mean, your tool with org-download makes archiving articles really easy, and I think it will be quite useful for many others too.

@alphapapa
Copy link
Owner

Maybe sometime. I haven't been able to use this tool lately because I haven't been able to get my browser to work with the MIME type to connect it to emacsclient. So I've been using org-web-tools to archive pages, which uses wget or archive.is, which downloads images, etc.

@Anton-Latukha
Copy link

Anton-Latukha commented Jan 14, 2020

@alphapapa We already supplied a couple of instructions on how to register the org-protocol handler in pull requests:

https://github.com/alphapapa/org-protocol-capture-html/pull/33/files

https://github.com/alphapapa/org-protocol-capture-html/pull/36/files

Maybe those guides would help.

I also during some updates use to lose the handler also, and by using my guide I redid the procedure - it helped.

@alphapapa
Copy link
Owner

@Anton-Latukha Just curious, is that the "royal we" or do you represent more than just yourself? :) I'll post on those PRs again.

@Anton-Latukha
Copy link

Anton-Latukha commented Jan 16, 2020

I really love myself. Even messaged in two links to two PRs of two different people. That share info on the topic you were concerned about. If we merge them - we even may help other people with that. I just helping to progress this discussion by helping you solve #28 (comment)

@sbwcwso
Copy link

sbwcwso commented May 20, 2023

Now, the image links are stripped of the website prefix, how can I fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants