‹ back home

URLs and percent encoding

2024-12-27 #caldav #http #webdav

My initial implementation for libdav was to always decode percent-encoded received from a server, and return them to consumers un-encoded. Likewise, paths passed as arguments should be provided un-encoded , and libdav would deal with encoding them itself before sending a request to the server. My mindset was “consumers should not need to worry about percent-encoding URLs, libdav can handle that internally”.

In the last couple of days, I’ve learnt that my approach was unsound.

I will use the path /path/to/theitem%2Fwithslash.ics as an example. This path refers to a resource named theitem%2Fwithslash.ics inside the collection /path/to/. However, decoding this path, it would return /path/to/theitem/withslash.ics, which points to resource withslash.ics in the /path/to/theitem/ collection. Clearly these are not the same.

URLs describe resources which don’t share the name naming limitations as regular file systems. Resources names may contain any byte sequence, including / (which needs to be escaped as %2F).

This behaviour is explicitly called out in the RFC3986, section 2.2:

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

[…] URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent-encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.

The next section makes a clarification for unreserved characters:

URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource.

When implementing client, it’s probably safest to treat the path component in URLs as an opaque string and not to alter it in any way. However, comparing whether two components are the same requires normalising them, by percent-decoding non-reserved characters, while maintaining percent-encoded reserved characters in their original form.

Double-encoding in WebDAV

In WebDAV, resources are listed inside an XML response. In this case, some characters need to be encoded as XML entities. I.e.: ", ', <, > and & need to be encoded as &quot;, "&apos;, &lt;, &gt; and &amp; respectively. These will be decoded when parsing the XML, which happens in a separate layer, and is invisible to HTTP.

Have comments or want to discuss this topic?
Send an email to my public inbox: ~whynothugo/public-inbox@lists.sr.ht.
Or feel free to reply privately by email: hugo@whynothugo.nl.

— § —