Sometimes data are entered incorrectly because the person entering data mishears a word (most often a name); for example, enters Awbrey instead of Aubrey. This is especially true for English language, which has complex spelling rules. There are algorithms may help you to find all words that are pronounced closely to a sample. Given an English word, such an algorithm calculates a code that represents how the word sounds. Take a look at the following sample of an algorithm called Soundex:
Here A160 is the calculated code of Aubrey. You can see that similar names like Awbrey or Abra have the same code. Such algorithms are called phonetic or phonetic indexing algorithms; the word “index” reflects the fact that the algorithm takes a single word and calculates a phonetic code, which is same for all words that sound alike. (The latter is not always true because algorithms aren't perfect). The code can be used on its own for searching, sorting, grouping and so on. In FileMaker terms this means the calculated code can be stored and used for fast searching and making relationships:
Other algorithms that can make a fuzzy comparison work differently: they take two words and compare their similarity. This isn't as convenient as having an index of a word; every time you need to search for words similar to the given one you must check each record in a table. In FileMaker terms this means the calculation will be unstored and slow.
Most known of phonetic indexing algorithms are: the Soundex family, Daitch-Motokotoff Soundex, NYSIIS, Caverphone, and Metaphone; I wish I could be able to write a custom function for each. For now let's start from the simplest and the most famous Soundex and Miracode.
The post is lengthy and even contains some historical background, so if you're interested primarily in the code, please scroll down to the bottom of the page.
A short history of Soundex and Miracode
Soundex was invented by Robert C. Russel and Margaret K. Odell, who received U.S. patents No. 1,261,167 (1918, Russel) and No. 1,435,663 (1922, Russel and Odell) for a method of indexing based on the way a name was pronounced rather than the way it was spelled. Initially the method was used to handle census and immigration data; now it also became a part of search algorithms, database management software, spell checking engines and so on.
Soundex was invented when computers didn't exist, so it's simple and the code can be applied manually. The principle is to separate letters into several phonetic classes:
0: vowels (“oral resonants”) a, e, i, o, u, y;
1: labials and labio-dentals b, f, p, v;
2: gutterals and sibilants c, g, k, q, s, z;
3: dental-mutes d, t;
4: palatal-fricative l;
5: labio-nasal m;
6: the dental or lingua-nasal n;
7: dental fricative r.
Original Soundex retained the 1st letter of a word, merged consecutive letters of the same class together, discarded trailing gh, s, or z and all vowels, except the first. Later classes 6 and 7 (m and n) were merged and the rules has changed. As of 2005 the most known variants are Simplified Soundex and American Soundex (or Miracode); there's also a number of non-standard realizations.
Simplified Soundex
This variant became one of most famous after it had been described by Donald E. Knuth in Art of Programming. I cite here a very clear description made by R. Parson:
Remember the 1st letter. Replace letters with digits as follows:
aehiouw → 0, bfpv → 1, cgjkqsxz → 2, dt → 3, l → 4, mn → 5, r → 6.
Replace rows of same digit with single such a digit. Replace the 1st digit with the remembered letter. Remove all zeros. If the result has more than four symbols, keep only the 1st four; if the result has less than four symbols, add trailing zeros.
As a result you'll get a four-character code of one letter and three digits. The four-letter codes can be written together or the 1st letter may be separated by a dash: W252 or W-252.
Examples:
A235: Acton, Ashdown, Ashton, Astin, Aston, Austen, Austin, Austine, Axten.
R360: Reader, Reeder, Rider, Ritter, Rothera, Rothra, Ruder, Rutter, Ryder.
Of course, Soundex isn't perfect. Sometimes it “matches” names that don't sound same, for example, Wace, Waugh, and Wookey (W200) and fails to match Abbot (A130) and Abbots (A132).
American Soundex or Miracode
This algorithm is exactly the same as Simplified Soundex with an addition of HW-rule: single h и w that separate consonants are discarded, so if the consonants belong to the same class they are merged. For example, Ashcroft is A226 in Simplified Soundex and A261 in Miracode.
On my test data this rule affected only a tiny 0,1% of all names (12 of about 14,700).
These variants may be specifically important because they were used in U.S. National Archives. Most archive data were encoded with Miracode, but there are some entries encoded with Simplified Soundex. The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 censuses were encoded with mixed methods.
Using Soundex to search these data adds a number of other rules, irrelevant to “pure” phonetical matching, but important to the subject of people names. For example, some prefixes must be discarded (Con, D', De, dela, Di, du, La, Le, van, Von, etc.) or expanded (St. becomes Saint). Researches must also be aware that these rules might have been used incorrectly and sometimes it's unclear which part of a multi-part name were encoded; for example, a nun Sister Veronica could be indexed under the “surname” Sister (S236).
Non-standard variants
When it's not necessary to conform to U.S. National Archives usage, for example, in a spell-checking program that uses Soundex to suggest replacement for a word which isn't found in the spelling database, some other variants of Soundex may be used. For example, some non-standard variants don't put the 1st letter in the code, but use a digit, or don't limit the length of the code. A nonstandard Soundex implementation is used in Microsoft SQL Server.
Typical implementation errors
Though the algorithm is simple its implementations often contain errors. A typical error is that the 1st letter isn't merged with the following letter of the same class: Lloyd is encoded as L430, while the correct code is L300, Pfister is encoded as P123, while the correct code is P236.
R. Parson says that another typical error is that instead of merging several letters of the same class only exactly same letters are merged. This error is very rough and is unlikely to be common.
The provided FileMaker implementation of Soundex and Miracode
The functions I present implement Soundex and Miracode with the following features bugs peculiarities:
No name-specific functionality. These functions don't discard or expand any prefixes, nor they standardize the input in any other name-specific way.
No fixed length. These functions don't produce a fixed-length codes of exactly 4 characters. This is done intentionally to leave room for experiments. If you need the classic Soundex or Miracode encoding, you can easily make it yourself:
Left( Soundex( text ) & "0000", 4 )
or with a dash:
Let( code = Left( Soundex( text ) & "0000", 4 ); Left( code, 1 ) & "-" & Right( code, 3 ) )
Consecutive H's and W's are merged. It's difficult to tell whether or not the HW-rule (see American Soundex above) applies only to single H's and W's or to any number of H's and W's between two consonants. On one hand the NARA page says that:
“If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right of the vowel is not coded”which seems to imply that only single H and W letters are a subject to the rule. On other hand other sources (Yep. More on this below.) have descriptions of the algorithms or samples clearly showing that any number of consecutive H or W letters are merged together to nothing; for example, Bradandcathy.com provides a sample Swhgler (which doesn't seem to be a name actually; if you try to google it, you'll find only Soundex test data) and claims that the correct code for it is S460, which shows that S and G are merged even if they are separated by WH.
“Other sources”
I used only Internet sources and have found a number of discrepancies and errors in descriptions of both the Soundex history and algorithms themselves (I've even found wrong test data!). Though I tried to do my best to find correct descriptions, I could still make errors; if you find any, please let me know.
If you're interested in the subject, here are some of the sources that have additional information about Soundex:
U.S. National Archives and Records Administration page on Soundex contains a short introductory essay. Interested users may order a free leaflet on using the Census Soundex (see at the bottom of the page).
Soundex — the true story by Rick Parson provides a good and seemingly solid background of Soundex as well as an implementation in C.
Overview of Soundex by Bradley Mohr and Kathleen Whalen contains a short description of Soundex with a few notes that may be important to researchers.
Matchmaker, Mitchmoker — is that a Match? Or De-mystifying De-duplication by Andrew Coates contains a broad overview of the problem of de-duplication record as well as a description of Soundex, and a FoxPro implementation of another phonetic indexing algoritm, NYSIIS.
Understanding Classic SoundEx Algorithms by Creativyst software provides a description of Soundex, Miracode, Creativyst own enhanced version, implementations in C, JavaScript, Perl and Visual Basic and an online form powered by the Perl encoder.
A SoundEx implementation in .NET by Richard Birkby contains a short description and implementation of 4 Soundex variants (Miracode, Simplified, Knuth, and SQLServer) in C#.
SOUNDX/X - Soundex and Metaphone ActiveX Control by Mabry Software implements two variants of Soundex and Metaphone.
Surname to Soundex Calculator has an online calculator and a number of tips and bits of Soundex history.
At the time I'm writing this (January 2006) I don't recommend the Wikipedia article on Soundex; it's rather inaccurate and the description of the algorithm contains an error. I'll try to correct it myself.
An auxiliary function Translate Character()
This function makes the Soundex and Miracode functions more readable. It takes three arguments:
Translate Character( character, source, target )
where character is a single letter, source and target are strings of same length. The 1st character of target is the code of the 1st character of source and so on.
Let( position = Position( source; character; 1; 1 ); Case( position ≠ 0; Middle( target; position; 1 ) ) )
The function returns character translated from source to target:
Let( [ source = "ABC"; target = "DEF" ]; Assert Equals( Translate Character( "A", source, target ), "D" ) & Assert Equals( Translate Character( "B", source, target ), "E" ) & Assert Equals( Translate Character( "C", source, target ), "F" ) & Assert Equals( Translate Character( "D", source, target ), "" ) & Assert Equals( Translate Character( "", source, target ), "" ) )
Soundex
Takes single argument:
Soundex( text )
If( IsEmpty( text ); ""; Let( [ current char = Upper( Left( text; 1 ) ); rest of the string = Soundex( Right( text; Length( text ) - 1 ) ) ]; current char & Case( not IsEmpty( rest of the string ); Let( [ LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; CODES = "01230120022455012623010202"; current char code = Translate Character( current char; LETTERS; CODES ); next char code = Translate Character( Left( rest of the string; 1 ); LETTERS; CODES ) ]; Case( current char code ≠ next char code and next char code ≠ 0; next char code ) & Right( rest of the string; Length( rest of the string ) - 1 ) ) ) ) )
Returns the Soundex code of text:
Assert Equals( Soundex( "A" ), "A" ) & Assert Equals( Soundex( "B" ), "B" ) & Assert Equals( Soundex( "C" ), "C" ) & Assert Equals( Soundex( "D" ), "D" ) & Assert Equals( Soundex( "E" ), "E" ) & Assert Equals( Soundex( "F" ), "F" ) & Assert Equals( Soundex( "G" ), "G" ) & Assert Equals( Soundex( "H" ), "H" ) & Assert Equals( Soundex( "I" ), "I" ) & Assert Equals( Soundex( "J" ), "J" ) & Assert Equals( Soundex( "K" ), "K" ) & Assert Equals( Soundex( "L" ), "L" ) & Assert Equals( Soundex( "M" ), "M" ) & Assert Equals( Soundex( "N" ), "N" ) & Assert Equals( Soundex( "O" ), "O" ) & Assert Equals( Soundex( "P" ), "P" ) & Assert Equals( Soundex( "Q" ), "Q" ) & Assert Equals( Soundex( "R" ), "R" ) & Assert Equals( Soundex( "S" ), "S" ) & Assert Equals( Soundex( "T" ), "T" ) & Assert Equals( Soundex( "U" ), "U" ) & Assert Equals( Soundex( "V" ), "V" ) & Assert Equals( Soundex( "W" ), "W" ) & Assert Equals( Soundex( "X" ), "X" ) & Assert Equals( Soundex( "Y" ), "Y" ) & Assert Equals( Soundex( "Z" ), "Z" ) & Assert Equals( Soundex( "AEHIOW" ), "A" ) & Assert Equals( Soundex( "BPFV" ), "B" ) & Assert Equals( Soundex( "CGJKQSXZ" ), "C" ) & Assert Equals( Soundex( "DT" ), "D" ) & Assert Equals( Soundex( "L" ), "L" ) & Assert Equals( Soundex( "MN" ), "M" ) & Assert Equals( Soundex( "R" ), "R" ) & Assert Equals( Soundex( "ADAD" ), "A33" ) & Assert Equals( Soundex( "ADAD" ), "A33" ) & Assert Equals( Soundex( "ADHD" ), "A33" ) & Assert Equals( Soundex( "ADWD" ), "A33" ) & Assert Equals( Soundex( "Ashcroft" ), "A22613" ) & Assert Equals( Soundex( "Tymczak" ), "T522" ) & Assert Equals( Soundex( "Jackson" ), "J25" ) & Assert Equals( Soundex( "Pfister" ), "P236" ) & Assert Equals( Soundex( "Gutierrez" ), "G362" ) & Assert Equals( Soundex( "Lee" ), "L" ) & Assert Equals( Soundex( "Washington" ), "W25235" ) & Assert Equals( Soundex( "Williams" ), "W452" ) & Assert Equals( Soundex( "Baragwanath" ), "B6253" ) & Assert Equals( Soundex( "Donnell" ), "D54" ) & Assert Equals( Soundex( "Lloyd" ), "L3" ) & Assert Equals( Soundex( "Woolcock" ), "W422" )
Miracode
Takes single argument:
Miracode( text )
If( IsEmpty( text ); ""; Let( [ current char = Upper( Left( text; 1 ) ); next char = Upper( Middle( text; 2; 1 ) ); rest of the string = Miracode( Right( text; Length( text ) - If( ( next char = "H" or next char = "W" ); 2; 1 ) ) ) ]; current char & Case( not IsEmpty( rest of the string ); Let( [ LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; CODES = "01230120022455012623010202" ; this char code = Translate Character( current char; LETTERS; CODES ); next char code = Translate Character( Left( rest of the string; 1 ); LETTERS; CODES ) ]; Case( this char code ≠ next char code and next char code ≠ 0; next char code ) & Right( rest of the string; Length( rest of the string ) - 1 ) ) ) ) )
Returns the American Soundex (Miracode) code of text:
Assert Equals( Miracode( "A" ), "A" ) & Assert Equals( Miracode( "B" ), "B" ) & Assert Equals( Miracode( "C" ), "C" ) & Assert Equals( Miracode( "D" ), "D" ) & Assert Equals( Miracode( "E" ), "E" ) & Assert Equals( Miracode( "F" ), "F" ) & Assert Equals( Miracode( "G" ), "G" ) & Assert Equals( Miracode( "H" ), "H" ) & Assert Equals( Miracode( "I" ), "I" ) & Assert Equals( Miracode( "J" ), "J" ) & Assert Equals( Miracode( "K" ), "K" ) & Assert Equals( Miracode( "L" ), "L" ) & Assert Equals( Miracode( "M" ), "M" ) & Assert Equals( Miracode( "N" ), "N" ) & Assert Equals( Miracode( "O" ), "O" ) & Assert Equals( Miracode( "P" ), "P" ) & Assert Equals( Miracode( "Q" ), "Q" ) & Assert Equals( Miracode( "R" ), "R" ) & Assert Equals( Miracode( "S" ), "S" ) & Assert Equals( Miracode( "T" ), "T" ) & Assert Equals( Miracode( "U" ), "U" ) & Assert Equals( Miracode( "V" ), "V" ) & Assert Equals( Miracode( "W" ), "W" ) & Assert Equals( Miracode( "X" ), "X" ) & Assert Equals( Miracode( "Y" ), "Y" ) & Assert Equals( Miracode( "Z" ), "Z" ) & Assert Equals( Miracode( "AEHIOW" ), "A" ) & Assert Equals( Miracode( "BPFV" ), "B" ) & Assert Equals( Miracode( "CGJKQSXZ" ), "C" ) & Assert Equals( Miracode( "DT" ), "D" ) & Assert Equals( Miracode( "L" ), "L" ) & Assert Equals( Miracode( "MN" ), "M" ) & Assert Equals( Miracode( "R" ), "R" ) & Assert Equals( Miracode( "ADAD" ), "A33" ) & Assert Equals( Miracode( "ADAD" ), "A33" ) & Assert Equals( Miracode( "ADHD" ), "A3" ) & Assert Equals( Miracode( "ADWD" ), "A3" ) & Assert Equals( Miracode( "Ashcroft" ), "A2613" ) & Assert Equals( Miracode( "Tymczak" ), "T522" ) & Assert Equals( Miracode( "Jackson" ), "J25" ) & Assert Equals( Miracode( "Pfister" ), "P236" ) & Assert Equals( Miracode( "Gutierrez" ), "G362" ) & Assert Equals( Miracode( "Lee" ), "L" ) & Assert Equals( Miracode( "Washington" ), "W25235" ) & Assert Equals( Miracode( "Williams" ), "W452" ) & Assert Equals( Miracode( "Baragwanath" ), "B6253" ) & Assert Equals( Miracode( "Donnell" ), "D54" ) & Assert Equals( Miracode( "Lloyd" ), "L3" ) & Assert Equals( Miracode( "Woolcock" ), "W422" )
Technorati tags: FileMaker, FileMaker 7, FileMaker 8, Soundex, Miracode.
These CF's look so much more elegant than any I've seen before. However, in my tests both Soundex and Miracode return only the first letter of the name and no numeric code. I'm afraid that I cannot spot my error - and I doubt it anything other than my error.
Let me know if you'd like me send you my sample file.
Chris.
Posted by: Chris Manton | June 18, 2006 at 08:37 PM
Yes, please send me your file. Maybe the page has a typo somewhere in function descriptions. I'll also try to post a sample file with working functions as soon as I can.
Posted by: m.edoshin | June 18, 2006 at 09:22 PM
Have you considered adding a check in the cases of names to handle a hard or soft 'C' if it's the first letter? For instance, I would prefer "Karl" and "Carl" to have the same soundex code. I've modified the soundex routine in dBase to consider 'C' as 'K' if followed by A,O,U, otherwise consider it an 'S'. This seems to work well in a database of about 12,000 names (doesn't elegantly address "Ch' but it works).
Rich
Posted by: guest | September 28, 2006 at 07:32 AM
Hi Rich. You're right, keeping the first letter same this is a known limitation of Soundex. But since Soundex is more or less a standard, I decided not to fiddle with it. There's lots of other, more modern algorithms that address this and other limitation. Maybe I'll write a function for Phonex, which is based on Soundex but is reported to be better.
Posted by: m.edoshin | September 30, 2006 at 08:27 PM
Hi
Thanks for posting this. It sure beats the hodgepodge of calculations I had developed back in the days of FileMaker Pro 4 to calculate the Soundex and Miracode codes for surnames.
The problem referenced by Chris Manton on 18 June seems to have been caused by a missing line in the Soundex function - I added
next char =
Upper( Middle( text; 2; 1 ) );
after current char and before rest of string and it works. The Miracode function worked as posted.
Cheers
Roger
Posted by: theKiwi | May 04, 2007 at 06:26 AM
Mikhail,
Thinking of doing a blog post on this - great technique - might throw together some sample files to demo an implementation, all credit to you off course.
Do you mind?
Cheers.
Posted by: Alex | February 23, 2010 at 06:41 PM
Whoops, typo in the site name :(
Posted by: Alex | February 23, 2010 at 06:43 PM
Hi Alex,
Sure, feel free to. I've subscribed to the blog, looking forward to the post :)
Posted by: m.edoshin | February 24, 2010 at 07:01 PM