Manual of Lua-DataFilter: pctenc

percent_encode and percent_decode - DataFilter algorithms for percent encoding and decoding


These two algorithms are part of the Lua-DataFilter package. See the overview documentation in lua-datafilter(3) for information about how to use them.

Default behaviour of percent_encode

percent_encode will escape any bytes in the input which it considers to be 'unsafe', as a percent sign followed by a two-digit hexadecimal number indicating the value of the original byte. Bytes which are safe are output unchanged.

The bytes which are escaped are those which don't occur in a defined set of 'safe' bytes. The default set of safe bytes are those which are US-ASCII characters unreserved for use in URIs by RFC 3986. These are all uppercase and lowercase letters, digits, and the characters hyphen (-), fullstop (.), underscore (_), and tilde (~).

Uppercase characters are always used for the hex numbers.

Options for percent_encode

The default set of safe characters can be overridden by providing the option safe_bytes, which should be a string. Each byte which occurs in this string is one which the percent_encode algorithm need not encode, so it will encode everything which isn't in this set.

An exception will be thrown if you try to declare the percent % character to be safe, because that would produce an encoded value which could not safely be decoded again (at least not without corruption).

An exception will also be thrown if you include the same byte twice in the safe_bytes string.

Default behaviour of percent_decode

This will reverse the encoding of characters performed by percent_encode. It does not matter which set of safe bytes were used for the encoding, they will all be either unescaped (when they are encoded with a % character), or passed through to the output as-is.

percent_decode currently doesn't accept any options.

Whenever a percent sign appears in the input it must be followed by two hexadecimal digits. If one of the two following characters is not a hexadecimal digit, or if there aren't enough characters left at the end of the input, then an exception will be thrown.

Both uppercase and lowercase letters may be used in hexadecimal digits.