Manual of Lua-DataFilter
Lua-DataFilter - Lua 5.1 module for munging arbitrarily large amounts of data
Overview
This module provides a small selection of algorithms and a simple API for feeding them arbitrarily large amounts of data, and storing arbitrarily large amounts of output.
A complete list of the algorithms provided is given at the bottom of this document. Suggestions for other algorithms which would be appropriate here are welcome.
Loading the module
The DataFilter module doesn't install itself into any global tables, so you can decide what name you want to use to access it. You will probably want to load it like this:
local Filter = require "datafilter"
You can use a variable called something other than Filter
if you'd like,
or you could assign the table returned by require
to a global variable.
In this documentation we'll assume you're using a variable called Filter
.
The simple API (string in, string out)
The most convenient way to use the algorithms is to simply call the Lua function provided for each one, passing in a string. The algorithm will return its results as a string.
print(Filter.base64_decode("Zm9vYmFy"))
print(Filter.hex_lower(Filter.md5("password")))
The name of the algorithm is the name of the Lua function you can call,
which will be provided directly in the table returned by
require "datafilter"
.
Algorithm options
Some algorithms accept options, which you can provide as a table as the second argument to their functions.
local options = { include_padding = true }
print(Filter.base64_encode("frob", options))
options.include_padding = false
print(Filter.base64_encode("frob", options))
local data = ("foobar"):rep(20)
print(Filter.base64_encode(data, { max_line_length = 76 }))
The options you can use for each algorithm are described in its documentation.
Processing large amounts of input
If the input data might be too large to load into a string, or if you want to
start processing data before all of it has arrived, you can create a DataFilter
object and feed input to it in chunks. Call :new
to create the object,
passing in the name of the algorithm (which is the same as the name of the
simple functions described above). Use the add
method to feed it the
contents of a string, and addfile
to feed in a whole file.
When you're finished adding input, the output is available as a string
from the result
method. You can call result
more than once if
necessary, and it will return the same string each time, but once you've
called it the processing is finished, so you can't add more input.
local obj = Filter:new("md5")
obj:add("string data")
obj:add("more string data")
obj:addfile("filename")
print(Filter.hex_lower(obj:result()))
The addfile
method can take a filename or a Lua file handle which has
already been opened for reading. If it's a file handle, it will be read
until there is no more data. The DataFilter object won't close the file
for you.
local obj = Filter:new("md5")
local fh = assert(io.open("filename", "rb"))
obj:addfile(fh)
fh:close()
print(Filter.hex_lower(obj:result()))
A file handle given to addfile
can actually be any object which has a
read
method, so you can create custom objects which emulate file handles.
The read
method will be the only thing called by addfile
. It will only
be called with a number, which indicates the maximum number of bytes it
should return. The method should always return one of the following:
- data
A string, no longer than the number specified.
- nil
Indicates end of file. The
read
method won't be called anymore after it returns this.- nil, error-message
The error message should be a string. This will cause
addfile
to throw an exception using the message.
Producing large amounts of output
Just as you can use the object-oriented interface to provide arbitrary amounts of input data, you can also send data to an output stream to cope with arbitrary amounts of it. The simplest kind of output stream is just a filename, which will be written to whenever the object has more output to send.
The output stream is always the second argument to :new()
, after the
name of the algorithm.
local obj = Filter:new("base64_encode", "output-filename")
obj:add("input string\n")
obj:addfile("input-filename")
obj:finish()
The finish
method indicates that all the input data has now been provided,
and causes all the remaining output to be sent. After that the output file
will be closed. You can't add more input after calling finish
. The
finish
method will be called automatically when the object is garbage
collected, but it's usually a good idea to call it explicitly, because it
may take some time before the garbage collector gets round to collecting
the object.
You can use a file handle as an output stream instead of a filename, and
the file handle can also be an object which emulates a Lua file handle.
In that case it must be an object (table or userdata) which has a write
method. This method will be called with a string each time more data is
ready.
Finally, the output stream can be a Lua function. This will be called directly with a string when output is ready to be sent.
local function output_callback (str)
print("more output: " .. str)
end
local obj = Filter:new("base64_encode", output_callback)
obj:add("input string\n")
obj:finish()
Passing options to the OO API
If you're using the object-oriented interface to DataFilter, you can still pass a table of options to the algorithm you're using when you call the constructor:
local obj = Filter:new("base64_encode", "output-filename",
{ max_line_length = 76 })
obj:add("input string\n")
bj:addfile("input-filename")
obj:finish()
If you want to provide options, but not an output stream, you can just
give nil
as the second argument.
Algorithms
These are the names of the algorithms provided by the DataFilter package
at this time. Each name can be called in the ways described above, either
as a simple function in the table returned by require
, or passed to
the :new()
method when using the OO interface.
- base64_decode, base64_encode
Decode ASCII text to binary data or encode binary data as plain text, using the Base64 algorithm given in RFC 4648. See lua-datafilter-base64(3) for details and available options.
- hex_decode
Textual input consisting of hexadecimal numbers is decoded into binary data. This is the reverse of the
hex_lower
andhex_upper
functions. This can be used to handle some encoded binary files stored in PostScript or PDF documents, as well as for decoding large hex numbers such as SHA1 hashes.Each pair of hexadecimal digits is decoded into one byte. Whitespace characters are ignored. Any other character in the input will cause an error, as will an odd number of hexadecimal characters.
- hex_lower, hex_upper
Binary input data is encoded as a series of hexadecimal digits, using either lowercase or uppercase letters for the digits 10–15. These algorithms don't have any options, and there isn't any input they consider to be invalid. The
hex_decode
function will reverse the operation of either of these functions.- percent_decode, percent_encode
Do percent encoding and decoding as defined by RFC 3986. This is also often known as 'URI encoding' or 'URI escaping'. See lua-datafilter-pctenc(3) for details and available options.
- qp_decode, qp_encode
Encode text using the quoted-printable encoding defined in RFC 2045, to make it safe for transit through baroque email systems. See lua-datafilter-qp(3) for details and available options.
There are also the following message digest, or hashing algorithms, which
all behave in the same basic way. None of them take any options, and they
all produce a small amount of binary output. None of them produce any
output until all the input data has been read. Usually, you'll want to
feed the output into the base64_encode
or hex_lower
algorithm to
get a human-readable result.
- adler32
Returns a 4 byte checksum. The algorithm is given in RFC 1950.
- md5
Returns a 16 byte message digest using the algorithm from RFC 1321.
- sha1
Returns a 20 byte message digest using the algorithm from RFC 3174.
Currently all the message digest algorithms are limited to input which is a multiple of 8 bits long (that is, you can only feed in bytes, not bits).
Copyright
This software and documentation is Copyright © 2007–2012 Geoff Richards <geoff at this domain dot co dot uk>. It is free software; you can redistribute it and/or modify it under the terms of the Lua 5.0 license. The full terms are given in the file COPYRIGHT supplied with the source code package, and are also available here: http://www.lua.org/license.html
The MD5 implementation was originally derived from the one in the Lua-MD5 module from the Kepler project, which is by Roberto Ierusalimschy and Marcela Ozro Suarez, and Copyright © 2003–2007 PUC-Rio. This version has been extensively modified to fit into the DataFilter architecture, so any bugs are likely my fault.