Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safety: safeify_url() function can't handle (invalid) URLs with [hostname] placeholder #2644

Open
dgw opened this issue Nov 17, 2024 · 4 comments · May be fixed by #2646
Open

safety: safeify_url() function can't handle (invalid) URLs with [hostname] placeholder #2644

dgw opened this issue Nov 17, 2024 · 4 comments · May be fixed by #2646
Labels
Bug Things to squish; generally used for issues

Comments

@dgw
Copy link
Member

dgw commented Nov 17, 2024

If someone sends a link while safety is active in the channel, and the URL contains a placeholder hostname in square brackets, Sopel will spit out an "Unexpected ValueError" message. Note: Seems to happen only on Python 3.11 or higher.

This is from the safeify_url() function added in #2279, which uses urllib.parse.urlparse() to make sanitizing URL parts easier, which in turn uses ipaddress.ip_address() to raise an error for bracketed IPv4 addresses—and trips on this weird edge case:

def safeify_url(url: str) -> str:
"""Replace bits of a URL to make it hard to browse to."""
parts = urlparse(url)
scheme = "hxx" + parts.scheme[3:] # hxxp
netloc = parts.netloc.replace(".", "[.]") # google[.]com and IPv4
netloc = netloc.replace(":", "[:]") # IPv6 addresses (bad lazy method)
return urlunparse((scheme, netloc) + parts[2:])

Simple examples using the Python console:

>>> # Python 3.10
>>> safety.safeify_url('http://[Target-IP]/cgi-bin/account_mgr.cgi')
'hxxp://[Target-IP]/cgi-bin/account_mgr.cgi'
>>> # Python 3.11
>>> safety.safeify_url('http://[Target-IP]/cgi-bin/account_mgr.cgi')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dgw/github/sopel-irc/sopel/sopel/builtins/safety.py", line 129, in safeify_url
    parts = urlparse(url)
            ^^^^^^^^^^^^^
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/urllib/parse.py", line 395, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: 'Target-IP' does not appear to be an IPv4 or IPv6 address

The version inconsistency is going to be the worst part of designing a "correct" fix for this. A simple fallback approach (such as return url.replace('http', 'hxxp', 1) if url.startswith('http') else url) will miss more complicated cases that are still handled fine in older Python versions (output using 3.10 shown here):

>>> safety.safeify_url('http://[Target.IP]/cgi-bin/account_mgr.cgi')  # dot gets bracketed
'hxxp://[Target[.]IP]/cgi-bin/account_mgr.cgi'
>>> safety.safeify_url('http://[Target:IP]/cgi-bin/account_mgr.cgi')  # colon gets bracketed
'hxxp://[Target[:]IP]/cgi-bin/account_mgr.cgi'

Do note though that all this is an edge case of an edge case. People must intentionally construct these invalid URLs, and can be trained to simply use another type of bracket for placeholders instead, such as http://<Target-IP>/cgi-bin/account_mgr.cgi.

@dgw dgw added the Bug Things to squish; generally used for issues label Nov 17, 2024
@half-duplex half-duplex linked a pull request Nov 21, 2024 that will close this issue
4 tasks
@half-duplex
Copy link
Member

PR opened to be more graceful about it, but considering http://<foo>/ and http://999.999.999.999/ are parsed successfully, I think this is a urllib bug?

>>> urlparse("http://<test>/")
ParseResult(scheme='http', netloc='<test>', path='/', params='', query='', fragment='')

@dgw
Copy link
Member Author

dgw commented Nov 21, 2024

but considering http://<foo>/ and http://999.999.999.999/ are parsed successfully, I think this is a urllib bug?

Square brackets are special. The reported "error" URL is actually invalid per RFC 3986 § 3.2.2:

A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax.

The only thing that's supposed to go in square brackets in a URI is an IPv6 (or later) literal, full stop. You could probably even get away with not "safeifying" these invalid links at all, since they can't be followed anyway.

And yes, I know that still leaves us with inconsistent behavior between different Python versions. But Python apparently decided urllib should be stricter about following the URI spec. 🤷‍♂️

@dgw
Copy link
Member Author

dgw commented Jan 9, 2025

Something's wonky, all right. Report came to IRC tonight that a message like CONNECT: [https://www.example.com] also trips this bug.

I'll want to test whether the PR fixes that too. It probably does.

@half-duplex
Copy link
Member

The PR does appear to fix that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Things to squish; generally used for issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants