My initial implementation for libdav
was to always decode percent-encoded received from a server, and return them to
consumers un-encoded. Likewise, paths passed as arguments should be provided
un-encoded , and libdav
would deal with encoding them itself before sending a
request to the server. My mindset was “consumers should not need to worry about
percent-encoding URLs, libdav
can handle that internally”.
In the last couple of days, I’ve learnt that my approach was unsound.
I will use the path /path/to/theitem%2Fwithslash.ics
as an example. This path
refers to a resource named theitem%2Fwithslash.ics
inside the collection
/path/to/
. However, decoding this path, it would return
/path/to/theitem/withslash.ics
, which points to resource withslash.ics
in
the /path/to/theitem/
collection. Clearly these are not the same.
URLs describe resources which don’t share the name naming limitations as regular
file systems. Resources names may contain any byte sequence, including /
(which needs to be escaped as %2F
).
This behaviour is explicitly called out in the RFC3986, section 2.2:
reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
[…] URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent-encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.
The next section makes a clarification for unreserved characters:
URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource.
When implementing client, it’s probably safest to treat the path component in URLs as an opaque string and not to alter it in any way. However, comparing whether two components are the same requires normalising them, by percent-decoding non-reserved characters, while maintaining percent-encoded reserved characters in their original form.
Double-encoding in WebDAV
In WebDAV, resources are listed inside an XML response. In this case, some
characters need to be encoded as XML entities. I.e.: "
, '
, <
, >
and &
need to be encoded as "
, "'
, <
, >
and &
respectively. These will be decoded when parsing the XML, which happens in a
separate layer, and is invisible to HTTP.