Manual of Lua-DataFilter

Manual of Lua-DataFilter

Lua-DataFilter - Lua 5.1 module for munging arbitrarily large amounts of data

Overview

This module provides a small selection of algorithms and a simple API for feeding them arbitrarily large amounts of data, and storing arbitrarily large amounts of output.

A complete list of the algorithms provided is given at the bottom of this document. Suggestions for other algorithms which would be appropriate here are welcome.

Loading the module

The DataFilter module doesn't install itself into any global tables, so you can decide what name you want to use to access it. You will probably want to load it like this:

local Filter = require "datafilter"

You can use a variable called something other than Filter if you'd like, or you could assign the table returned by require to a global variable. In this documentation we'll assume you're using a variable called Filter.

The simple API (string in, string out)

The most convenient way to use the algorithms is to simply call the Lua function provided for each one, passing in a string. The algorithm will return its results as a string.

print(Filter.base64_decode("Zm9vYmFy"))

print(Filter.hex_lower(Filter.md5("password")))

The name of the algorithm is the name of the Lua function you can call, which will be provided directly in the table returned by require "datafilter".

Algorithm options

Some algorithms accept options, which you can provide as a table as the second argument to their functions.

local options = { include_padding = true }
print(Filter.base64_encode("frob", options))
options.include_padding = false
print(Filter.base64_encode("frob", options))

local data = ("foobar"):rep(20)
print(Filter.base64_encode(data, { max_line_length = 76 }))

The options you can use for each algorithm are described in its documentation.

Processing large amounts of input

If the input data might be too large to load into a string, or if you want to start processing data before all of it has arrived, you can create a DataFilter object and feed input to it in chunks. Call :new to create the object, passing in the name of the algorithm (which is the same as the name of the simple functions described above). Use the add method to feed it the contents of a string, and addfile to feed in a whole file.

When you're finished adding input, the output is available as a string from the result method. You can call result more than once if necessary, and it will return the same string each time, but once you've called it the processing is finished, so you can't add more input.

local obj = Filter:new("md5")

obj:add("string data")
obj:add("more string data")

obj:addfile("filename")

print(Filter.hex_lower(obj:result()))

The addfile method can take a filename or a Lua file handle which has already been opened for reading. If it's a file handle, it will be read until there is no more data. The DataFilter object won't close the file for you.

local obj = Filter:new("md5")

local fh = assert(io.open("filename", "rb"))
obj:addfile(fh)
fh:close()

print(Filter.hex_lower(obj:result()))

A file handle given to addfile can actually be any object which has a read method, so you can create custom objects which emulate file handles. The read method will be the only thing called by addfile. It will only be called with a number, which indicates the maximum number of bytes it should return. The method should always return one of the following:

data

A string, no longer than the number specified.

nil

Indicates end of file. The read method won't be called anymore after it returns this.

nil, error-message

The error message should be a string. This will cause addfile to throw an exception using the message.

Producing large amounts of output

Just as you can use the object-oriented interface to provide arbitrary amounts of input data, you can also send data to an output stream to cope with arbitrary amounts of it. The simplest kind of output stream is just a filename, which will be written to whenever the object has more output to send.

The output stream is always the second argument to :new(), after the name of the algorithm.

local obj = Filter:new("base64_encode", "output-filename")

obj:add("input string\n")
obj:addfile("input-filename")

obj:finish()

The finish method indicates that all the input data has now been provided, and causes all the remaining output to be sent. After that the output file will be closed. You can't add more input after calling finish. The finish method will be called automatically when the object is garbage collected, but it's usually a good idea to call it explicitly, because it may take some time before the garbage collector gets round to collecting the object.

You can use a file handle as an output stream instead of a filename, and the file handle can also be an object which emulates a Lua file handle. In that case it must be an object (table or userdata) which has a write method. This method will be called with a string each time more data is ready.

Finally, the output stream can be a Lua function. This will be called directly with a string when output is ready to be sent.

local function output_callback (str)
    print("more output: " .. str)
end

local obj = Filter:new("base64_encode", output_callback)
obj:add("input string\n")
obj:finish()

Passing options to the OO API

If you're using the object-oriented interface to DataFilter, you can still pass a table of options to the algorithm you're using when you call the constructor:

local obj = Filter:new("base64_encode", "output-filename",
                       { max_line_length = 76 })

obj:add("input string\n")
bj:addfile("input-filename")

obj:finish()

If you want to provide options, but not an output stream, you can just give nil as the second argument.

Algorithms

These are the names of the algorithms provided by the DataFilter package at this time. Each name can be called in the ways described above, either as a simple function in the table returned by require, or passed to the :new() method when using the OO interface.

base64_decode, base64_encode

Decode ASCII text to binary data or encode binary data as plain text, using the Base64 algorithm given in RFC 4648. See lua-datafilter-base64(3) for details and available options.

hex_decode

Textual input consisting of hexadecimal numbers is decoded into binary data. This is the reverse of the hex_lower and hex_upper functions. This can be used to handle some encoded binary files stored in PostScript or PDF documents, as well as for decoding large hex numbers such as SHA1 hashes.

Each pair of hexadecimal digits is decoded into one byte. Whitespace characters are ignored. Any other character in the input will cause an error, as will an odd number of hexadecimal characters.

hex_lower, hex_upper

Binary input data is encoded as a series of hexadecimal digits, using either lowercase or uppercase letters for the digits 10–15. These algorithms don't have any options, and there isn't any input they consider to be invalid. The hex_decode function will reverse the operation of either of these functions.

percent_decode, percent_encode

Do percent encoding and decoding as defined by RFC 3986. This is also often known as 'URI encoding' or 'URI escaping'. See lua-datafilter-pctenc(3) for details and available options.

qp_decode, qp_encode

Encode text using the quoted-printable encoding defined in RFC 2045, to make it safe for transit through baroque email systems. See lua-datafilter-qp(3) for details and available options.

There are also the following message digest, or hashing algorithms, which all behave in the same basic way. None of them take any options, and they all produce a small amount of binary output. None of them produce any output until all the input data has been read. Usually, you'll want to feed the output into the base64_encode or hex_lower algorithm to get a human-readable result.

adler32

Returns a 4 byte checksum. The algorithm is given in RFC 1950.

md5

Returns a 16 byte message digest using the algorithm from RFC 1321.

sha1

Returns a 20 byte message digest using the algorithm from RFC 3174.

Currently all the message digest algorithms are limited to input which is a multiple of 8 bits long (that is, you can only feed in bytes, not bits).

Copyright

This software and documentation is Copyright © 2007–2012 Geoff Richards <geoff at this domain dot co dot uk>. It is free software; you can redistribute it and/or modify it under the terms of the Lua 5.0 license. The full terms are given in the file COPYRIGHT supplied with the source code package, and are also available here: http://www.lua.org/license.html

The MD5 implementation was originally derived from the one in the Lua-MD5 module from the Kepler project, which is by Roberto Ierusalimschy and Marcela Ozro Suarez, and Copyright © 2003–2007 PUC-Rio. This version has been extensively modified to fit into the DataFilter architecture, so any bugs are likely my fault.