Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frapell enhancements #398

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions config.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,9 @@
LOG_TITLES = DIR_TEMP + "/titles.txt"
LOG_LOCALE = DIR_TEMP + "/locale.txt"

# Maximum filename length
IMG_MAX_NAME_LEN = 240

# prefix for URL of local images
IMAGES_URL_PREFIX = "/images/"

Expand Down
1 change: 1 addition & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ progress
pytest
pytest-mock
lxml
python-magic
11 changes: 10 additions & 1 deletion src/images/embed.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,28 @@
"""Embed pre-selected images in HTML source."""

import logging
import magic
import os

import bs4

import config

logger = logging.getLogger('images.embed')
mimetype = magic.Magic(mime=True)


def image_is_embeddable(imgpath, imgsize):
"""Decide if given image will be embedded in HTML source."""
result = False
_, ext = os.path.splitext(imgpath)
return ext.lower() == '.svg' and imgsize < 40960
if ext.lower() == '.svg' and imgsize < 40960:
# Do not assume an image is an SVG based only in file extension
if os.path.exists(imgpath):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hay casos donde preguntamos por la imagen y la misma no está?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No debería, pero mimetype.from_file levanta una excepción si no encuentra el archivo y tenés que arrancar todo el proceso de nuevo... Me parece mas prudente dejar que termine y después uno decide si la imagen que faltó (y que no debería) amerita empezar la corrida nuevamente.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Te cambio el punto de vista de la pregunta.

Esa imagen siempre está, no vale la pena preguntar si existe. Si esa imagen no está, realmente tenemos un bug en otro lado (fijate de donde se llama a esta función). Entonces, no preguntes si la imagen está, porque estás escondiendo un potencial problema.

mt = mimetype.from_file(imgpath)
if mt.startswith('image/svg'):
result = True
return result


class _EmbedImages:
Expand Down
19 changes: 19 additions & 0 deletions src/images/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,25 @@ def replace(tag):
logger.warning("Unsupported image with GET args. Won't be included: %s", dsk_url)
return None, None

# Make sure dsk_url lenght is below filesystem limit
limit = config.IMG_MAX_NAME_LEN
basedir, filename = os.path.split(dsk_url)
name, ext = os.path.splitext(filename)
if len(filename) > limit:
# We cannot simply get the [:limit] part of the name, since we
# cannot know if we will have conflicts with other image names,
# so we'll split the filename into subfolders.
# superbigfilename.png would be super/bigfi/lename.png
logger.debug("Filename too long for %r", dsk_url)
new_split_name = []
for i in range((len(name)//limit)+1):
new_part = name[i*limit:(i+1)*limit]
new_part.replace('.', '').replace('-', '').replace('/', '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No entiendo el por qué de estos replaces...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

El - no recuerdo por qué está ahí... pero el . y el / son por las dudas que puedieran generar algun conflicto en algún sistema de archivos, pero los puedo sacar, si te parece.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agreguemos código "por las dudas". Si no podés crear un test en test_extract.py que "explote" eso, entonces no hay un caso que amerite...

new_split_name.append(new_part)
new_dir = os.path.join(*new_split_name)
dsk_url = os.path.join(basedir,new_dir)
dsk_url += ext

logger.debug("web url: %r, dsk_url %r", web_url, dsk_url)

# Replace the width and height by a querystring in the src of the image
Expand Down
2 changes: 1 addition & 1 deletion src/images/scale.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def run(verbose, src):
logger.debug("Rescaling to %d%% image %s", scale, dskurl)
if scale == 100:
done_now[dskurl] = 100
if embed_enabled and image_is_embeddable(dskurl, imgsize):
if embed_enabled and image_is_embeddable(frompath, imgsize):
# don't copy image, leave it out of image blocks, it will
# be embedded from original location (without any reduction)
images_embed.add(dskurl)
Expand Down
3 changes: 2 additions & 1 deletion src/scraping/css.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
Reference: https://www.mediawiki.org/wiki/API:Styling_content
"""

import html
import logging
import functools
import os
Expand Down Expand Up @@ -163,7 +164,7 @@ def _module_names(self):

unique_names = set()
for link in raw_links:
url = urllib.parse.urlparse(link)
url = urllib.parse.urlparse(html.unescape(link))
query = dict(urllib.parse.parse_qsl(url.query))
names = query.get('modules')
if not names:
Expand Down
7 changes: 6 additions & 1 deletion src/scraping/scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,11 +130,16 @@ def fetch_html(url):
try:
req = urllib.request.Request(url, headers=REQUEST_HEADERS)
resp = urllib.request.urlopen(req, timeout=60)
compressedstream = io.BytesIO(resp.read())
resp_content = resp.read()
compressedstream = io.BytesIO(resp_content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¿este cambio para qué es?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Es porque corté el branch de antes que se mergee #397 suponia que luego de mergearse no iba a mostrar el diff, no se por que lo sigue mostrando...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

En tu máquina local, contra master, no ves esta parte del diff?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fijate que este PR dice que está "outdated contra master", quizás te falta actualizar tu branch, y por eso se sigue viendo acá?

gzipper = gzip.GzipFile(fileobj=compressedstream)
html = gzipper.read().decode('utf-8')
return html

except gzip.BadGzipFile:
# response content is uncompressed
return resp_content.decode('utf-8')

except Exception as err:
if isinstance(err, urllib.error.HTTPError) and err.code == 404:
raise FetchingError("Failed with HTTPError 404 on url %r", url)
Expand Down