Wednesday, November 26, 2008

Squid underscore

http://www.squid-cache.org/mail-archive/squid-users/200208/0565.html
To make Squid allow any characters just comment out the block in url.c
starting by

if (strspn(host, valid_hostname_chars) != strlen(host)) {

To allow underscores, you can compile Squid with the
--enable-underscores option.

Some notes on the standards of hostnames, HTTP, DNS etc:

Note 1: The HTTP specification (RFC2616) is rather strict on the
characters allowed in HTTP URL:s:

From RFC2396 "URI" (referenced by RFC2616 "HTTP/1.1"):

URL schemes that involve the direct use of an IP-based protocol to a
specified server on the Internet use a common syntax for the server
component of the URI's scheme-specific data:

@:

...

hostport = host [ ":" port ]
host = hostname | IPv4address
hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum

as seen this does not allow underscores in Internet URL host names. Only
A-Z 0-9 and - (case insensitive).

Note 2: What restricts hostnames is not the DNS specification, but the
"Requirements for Internet hosts" specification Internet STD 0003,
specifically RFC1123 section 2.1. This too does not allow underscores,
and is from what I can tell identical to the rules above.

Note 3: DNS as such allows any characters in the DNS protocol and from
what I understand has always done so even if the specification has been
a bit ambigous on this. To DNS a "domain name label" is just a binary
sequence. What RFC2181 clarifies is that the DNS protocol does not by
itself put any restrictions on the labels used within DNS. DNS is just a
protocol for resolving namespaces, not a namespace definition. DNS does
not define the namespace of Internet hosts, only a protocol that can be
used to resolve names within the Internet host namespace and optionally
other namespaces. The fact that DNS allows for "any kind of data" does
not say that it is allowed to use "any kind of data" for Internet host
names.

Note 4: There is an ongoing task within IETF to standardise how to use
national characters etc in Internet hostnames, but there still is lots
to define before the goal is reached. The long term goal has been
defined to move to use UTF8 in all Internet protocols, and I think also
a intermediary wrapper in the application layer has finally been defined
transforming "national host names" to/from "current hostname syntax
conformant names" to be used until the transition to UTF8 can be done.
As the HTTP specification has not yet been revised to support UTF8
applications wishing to use "national host names" SHOULD use such
translation layer when using such host names within HTTP. Also, I am
uncertain on the fate of underscores in this context..

Note 5: Other standard Internet applications such as E-Mail also puts
restrictions on allowable syntax for hostnames within the protocol in
addition to STD 0003. HTTP is not alone.

No comments: