Lexical ordering of Hawaiian words

Published 11 August 2013 at 21:42

In order to update Wiktionary's index of Hawaiian words I tried to figure out how they should be sorted. It turns out it's a bit tricky to find out about this, so I'll make these notes for the future.

Peculiarities of Hawaiian orthography

The Hawaiian alphabet (ka pīʻāpā Hawaiʻi) has one consonant which isn't in the English alphabet, the ʻokina representing a glottal stop. Like any Hawaiian consonant it's always followed by a vowel, and it can appear at the start of a word (although I haven't quite figured out how to pronounce it there).

Ideally the character to use for ʻokina is U+02BB MODIFIER LETTER TURNED COMMA (so it's a letter as far as Unicode is concerned, meaning things like text selection will include it in a word). For various reasons it often gets written as the ASCII apostrophe U+0027 or ASCII grave/backtick character U+0060, and not all fonts will have the proper symbol for it.

Hawaiian texts of the nineteenth and early twentieth centuries rarely represented the glottal stop, except to resolve ambiguities. It was later made a full-fledged letter, and is now almost always spelled out explicitly. In English loanwords from Hawaiian the ʻokina is usually omitted, such as in the word Hawaiian itself, or in English muumuu from Hawaiian muʻumuʻu, or English ahi from Hawaiian ʻahi. The native name of the country is Hawaiʻi, but can be spelled in English as Hawaii or Hawai'i (with a normal English apostrophe character).

By the way, the ʻokina is unicameral (has only one symbol, not separate ones for upper and lowercase). The conventional way to capitalize the first letter of a Hawaiian word when it starts with an ʻokina is to instead uppercase the vowel after it.

The other main special consideration is the long vowels. There are five short vowels in Hawaiian, and each can be lengthened with a kahakō (a macron, a straight line above the letter). Historically the kahakō also weren't written down, but usually are now.

Finally, although Hawaiian only really has eight consonants, there are a few loanwords and older spellings which use other letters of the English alphabet. Most of the distinctions in sound these were supposed to represent turned out to be allophones, and would now be represented with the eight phonemic consonants, but if they turn up in a list of words they need to be sorted somewhere.

Dictionary order: Hawaiian Dictionary

There seem to be two main Hawaiian dictionaries that could provide a standardized sorting order, although unfortunately they use two different ones. Firstly, the well-known Hawaiian Dictionary by Mary Kawena Pukui and Samuel H. Elbert. This seems to be the less popular sorting method, but since I've mostly figured out the rules I might as well document them.

The letters in this dictionary are sorted by the English alphabetical order, so that vowels, Hawaiian consonants and non-Hawaiian consonants are intermingled. Looking at the edition from 1986 on the Google Books preview, pages 140–141 have this sequence:

ke
kē
kea
keʻa
keʻaawaileia
kēʻae
keʻahakahaka
keahi

So it looks like words are first sorted with the ʻokina and kahakō ignored. Then, if two words are the same, one with an ʻokina goes after one without, and if they're still the same then one with a kahakō goes after one without. I've also seen the sequence (on page 9) aʻiaʻi followed by ʻaiʻai, which suggests latter ʻokina go before earlier ones. I haven't found an example that would show whether the same happens with kahakō in different positions.

Dictionary order: Māmaka kaiao

The other dictionary is the more modern (and I think continuously updated) Māmaka kaiao, published by the Hawaiian Lexicon Committee (Kōmike Huaʻōlelo). There's an introductory section which explains the ordering they use. They start with the vowels, then the Hawaiian consonants, then other consonants. Each group internally is sorted in English alphabetical order, with ʻokina the last of the Hawaiian consonants (so for example ahi would be near the start of a dictionary, but ʻahi not far from the end).

So the full ordering is as follows:

Letter	Name
A	ʻā
E	ʻē
I	ʻī
O	ʻō
U	ʻū
H	hē
K	kē
L	lā
M	mū
N	nū
P	pī
W	wē
ʻ	ʻokina
B	—
C	—
D	—
F	—
G	—
J	—
Q	—
R	—
S	—
T	—
V	—
X	—
Y	—
Z	—

The kahakō aren't considered separate letters in this ordering, but affect the ordering when two words are otherwise identical. I haven't been able to find examples in the dictionary of what happens when there are more than two spellings only differentiated by kahakō, but it seems logical that a kahakō earlier in the word should make a bigger difference. That's what I've assumed in the program below anyway.

Implementation

So I knocked up a rough-and-ready Perl program to implement this. It takes one or more input filenames as arguments, or reads from stdin. The results are printed to stdout. It ignores any character that isn't a Hawaiian letter, and treats a bunch of likely characters as alternatives for ʻokina. The -u option removes duplicate lines on output.

#!/usr/bin/perl
# hawsort, version 1.0.

use warnings;
use strict;
use utf8;
use Getopt::Std;

sub usage {
    my ($code) = @_;
    print STDERR "Usage: $0 [-uh] [input filenames...]\n";
    exit $code;
}

sub proper_okina {
    my ($c) = @_;
    # I'll assume these were all meant to be ʻokina.
    $c =~ s/['`‘’]/ʻ/;
    return $c;
}

sub char_type {
    my ($c) = @_;
    return 1 if $c =~ /\A[aeiouāēīōū]\z/;
    return 2 if $c =~ /\A[hklmnpwʻ]\z/;
    return 3 if $c =~ /\A[bcdfgjqrstvxyz]\z/;
    return 0;
}

sub shorten_vowel {
    my ($c) = @_;
    return "a", 1 if $c eq "ā";
    return "e", 1 if $c eq "ē";
    return "i", 1 if $c eq "ī";
    return "o", 1 if $c eq "ō";
    return "u", 1 if $c eq "ū";
    return $c, 0;
}

# This uses ord() to compare characters by codepoint to avoid any
# interference from whatever locale is in play.
# (Global locales are a crock.)
sub hawaiian_order {
    my @a = split "", $a;
    my @b = split "", $b;
    my $has_long_vowels = 0;

    while (@a && @b) {
        my $c1 = lc $a[0];
        my $c2 = lc $b[0];
        $c1 = proper_okina($c1);
        $c2 = proper_okina($c2);
        my $type1 = char_type($c1);
        my $type2 = char_type($c2);

        # Ignore anything which isn't a letter.
        shift @a if $type1 == 0;
        shift @b if $type2 == 0;
        next if $type1 == 0 || $type2 == 0;

        shift @a;
        shift @b;

        # Vowels first, then Hawaiian consonants, then other consonants.
        return $type1 <=> $type2 if $type1 != $type2;
        if ($type1 != 1) {
            next if $c1 eq $c2;
            return ord($c1) <=> ord($c2);
        }

        # Ignore difference between long and short vowels if possible.
        my ($c1_short, $c1_is_long) = shorten_vowel($c1);
        my ($c2_short, $c2_is_long) = shorten_vowel($c2);
        return ord($c1_short) <=> ord($c2_short)
            if $c1_short ne $c2_short;
        $has_long_vowels = 1
            if $c1_is_long || $c2_is_long;
    }

    return -1 if @b;
    return +1 if @a;

    # If they're otherwise equal, short vowels before kahakō.
    if ($has_long_vowels) {
        my @a = split "", $a;
        my @b = split "", $b;
        while (@a) {
            my $c1 = lc shift @a;
            my $c2 = lc shift @b;
            next if $c1 eq $c2;
            my (undef, $c1_is_long) = shorten_vowel($c1);
            my (undef, $c2_is_long) = shorten_vowel($c2);
            return -1 if !$c1_is_long && $c2_is_long;
            return +1 if $c1_is_long && !$c2_is_long;
        }
    }

    return 0;
}

my $uniq;
{
    my %opt;
    getopts("uh", \%opt) or usage(1);
    usage(0) if $opt{h};
    $uniq = $opt{u};
}
binmode STDOUT, ":utf8" or die;

# It would be nice just to use the magic <>, but I'm not sure
# how to do that with the binmode :utf8 in effect.
my @lines;
if (@ARGV) {
    for my $filename (@ARGV) {
        open my $fh, "<", $filename
            or die "error opening input file '$filename': $!\n";
        binmode $fh, ":utf8" or die;
        while (<$fh>) {
            chomp;
            push @lines, $_;
        }
    }
}
else {
    binmode STDIN, ":utf8" or die;
    while (<>) {
        chomp;
        push @lines, $_;
    }
}

@lines = sort hawaiian_order @lines;

my $prev_line;
for (@lines) {
    next if $uniq && defined $prev_line && $_ eq $prev_line;
    print "$_\n";
    $prev_line = $_;
}