Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FILTER_VALIDATE_URL returns false when underscore present in URL #17842

Open
eelkefierstra opened this issue Feb 17, 2025 · 4 comments
Open

FILTER_VALIDATE_URL returns false when underscore present in URL #17842

eelkefierstra opened this issue Feb 17, 2025 · 4 comments

Comments

@eelkefierstra
Copy link

Description

The following code:

<?php
var_dump(filter_var('https://sub_domain.example.com', FILTER_VALIDATE_URL));
var_dump(filter_var('https://ex_ample.com', FILTER_VALIDATE_URL));

Resulted in this output:

bool(false)
bool(false)

But I expected this output instead:

string(30) "https://sub_domain.example.com"
string(20) "https://ex_ample.com"

The underscore is a valid character according to the RFC 2396 section 2.3:

Unreserved Characters

Data characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include upper and lower case
letters, decimal digits, and a limited set of punctuation marks and
symbols.

  unreserved  = alphanum | mark

  mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.

But this filter fails if a underscore is present in the domain or subdomain portion of the URL.

This RFC is superseded by RFC 3986, but the underscore is still in the unreserved characters:

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

PHP Version

PHP 8.4.4

Operating System

No response

@nielsdos
Copy link
Member

If we look at your uri, it's an absolute URI:

absoluteURI   = scheme ":" ( hier_part | opaque_part )

And it wants to follow hier_part:

hier_part     = ( net_path | abs_path ) [ "?" query ]

We have a double slash, so it's a net_path:

net_path      = "//" authority [ abs_path ]
authority     = server | reg_name

It seems that the filter extension only checks the server part (i.e. it'll check for hostname), but not reg_name.

@cmb69
Copy link
Member

cmb69 commented Feb 17, 2025

Blame parse_url(): https://3v4l.org/elAkGk

@eelkefierstra
Copy link
Author

eelkefierstra commented Feb 17, 2025

@nielsdos Thanks for reading further into the specification than I initially did.

Section 3.2.1 specifies the "Registry-based Naming Authority" with the reg_name. But the PHP flag asks to verify a URL, which is detailed under section 3.2.2 of the RFC:

URL schemes that involve the direct use of an IP-based protocol to a
specified server on the Internet use a common syntax for the server
component of the URI's scheme-specific data:

  <userinfo>@<host>:<port>

Where the server part of authority is specified as

server        = [ [ userinfo "@" ] hostport ]

And hostport is

  hostport      = host [ ":" port ]
  host          = hostname | IPv4address
  hostname      = *( domainlabel "." ) toplabel [ "." ]
  domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
  toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

Which means that strictly following the RFC, domains cannot contain underscores (they work in DNS and on the internet, but not as specified in this RFC).

So the PHP filter seems to be correct for filtering valid URLs, strictly following the RFC.
And as @cmb69 shows that if you want a URL to be loosely checked, you can use parse_url.

@cmb69
Copy link
Member

cmb69 commented Feb 17, 2025

And as @cmb69 shows that if you want a URL to be loosely checked, you can use parse_url.

To clarify: FILTER_URL is implemented by first parsing the URL (same implementation as parse_url()), and than applies some additional checks to reject invalid URLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants