Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 35 |
Nodes: | 6 (1 / 5) |
Uptime: | 06:02:43 |
Calls: | 320 |
Files: | 956 |
Messages: | 82,181 |
Hi folks, hope you all are doing well.
Please excuse long post, wanted to share this, some might find
it handy given a certain context. Must run, I'm very behind in
my work (hey I'm always running behind!)
# metaphone.awk: Michael Sanders - 2024
#
# example invocation:
#
# echo "texas taxes taxi" | awk -f metaphone.awk -v find=texas
#
# notes:
#
# ever notice when you search for (say):
#
# 'i went to the zu'
#
# & your chosen search engine suggests something like:
#
# 'did you mean i went to the zoo'
#
# the metaphone algorithm handles such cases pretty well actually...
#
# Metaphone is a phonetic algorithm, published by Lawrence Philips in
# 1990, for indexing words by their English pronunciation. It
# fundamentally improves on the Soundex algorithm by using information
# about variations and inconsistencies in English spelling and
# pronunciation to produce a more accurate encoding, which does a
# better job of matching words and names which sound similar.
# https://en.wikipedia.org/wiki/Metaphone
#
# english only (sorry)
#
# not extensively tested, nevertheless a solid start, if you
# improve this code please share your results
#
# other implentations...
#
# gist: https://gist.github.com/Rostepher/b688f709587ac145a0b3
#
# BASIC: http://aspell.net/metaphone/metaphone.basic
#
# C: http://aspell.net/metaphone/metaphone-kuhn.txt
# check if a character is a vowel
function isvowel(c, is_vowel) {
is_vowel = c ~ /[AEIOU]/
return is_vowel
}
porkchop@invalid.foo (Mike Sanders) writes:
Hi folks, hope you all are doing well.
Please excuse long post, wanted to share this, some might find
it handy given a certain context. Must run, I'm very behind in
my work (hey I'm always running behind!)
Using a word list, I found some odd matches. For example:
$ echo "drunkeness indigestion" | awk -f metaphone.awk -v find=texas drunkeness
indigestion
Are these really metaphone matches for "texas"? It's possible (I don't
know the algorithm at all well) but I found it surprising.
# metaphone.awk: Michael Sanders - 2024
#
# example invocation:
#
# echo "texas taxes taxi" | awk -f metaphone.awk -v find=texas
#
# notes:
#
# ever notice when you search for (say):
#
# 'i went to the zu'
#
# & your chosen search engine suggests something like:
#
# 'did you mean i went to the zoo'
#
# the metaphone algorithm handles such cases pretty well actually...
#
# Metaphone is a phonetic algorithm, published by Lawrence Philips in
# 1990, for indexing words by their English pronunciation. It
# fundamentally improves on the Soundex algorithm by using information
# about variations and inconsistencies in English spelling and
# pronunciation to produce a more accurate encoding, which does a
# better job of matching words and names which sound similar.
# https://en.wikipedia.org/wiki/Metaphone
#
# english only (sorry)
#
# not extensively tested, nevertheless a solid start, if you
# improve this code please share your results
#
# other implentations...
#
# gist: https://gist.github.com/Rostepher/b688f709587ac145a0b3
#
# BASIC: http://aspell.net/metaphone/metaphone.basic
#
# C: http://aspell.net/metaphone/metaphone-kuhn.txt
I wanted a "reference" implementation I could try, but this is not a
useful C program. It's in a odd dialect (it uses void but has K&R
function definitions) and has loads of undefined behaviours (strcpy of overlapping strings, use if uninitialised variables etc).
Using a word list, I found some odd matches.
# Metaphone Algorithm in AWK v2: Michael Sanders - 2024
# entry point...
{
for (x = 1; x <= NF; x++) {
if (similarity(metaphone($x, 10), find_code) >= 80)
print find " : " $x
}
}
Using a word list, I found some odd matches. For example:
$ echo "drunkeness indigestion" | awk -f metaphone.awk -v find=texas drunkeness
indigestion
Are these really metaphone matches for "texas"? It's possible (I don't
know the algorithm at all well) but I found it surprising.
Ben Bacarisse <ben@bsb.me.uk> wrote:
Using a word list, I found some odd matches. For example:
$ echo "drunkeness indigestion" | awk -f metaphone.awk -v find=texas
drunkeness
indigestion
Are these really metaphone matches for "texas"? It's possible (I don't
know the algorithm at all well) but I found it surprising.
Ben, give this try when you can. Finally starting to wrap my mind around
its usage a little more...
I don't know what your are asking for as this (your latest AWK) is not
just an implementation of the metaphone algorithm. With the extra Levenshtein test it "texas" matches only a few words.
However, if I remove the extra condition (that levenshtein($x, find) <=
2) your AWK code matches a different set of words to the C
implementation. Looking a bit deeper, your AWK code give the code TKSS
to the word "texas" but the C code assigns is "TKS".
Ben Bacarisse <ben@bsb.me.uk> wrote:
However, if I remove the extra condition (that levenshtein($x, find) <=
2) your AWK code matches a different set of words to the C
implementation. Looking a bit deeper, your AWK code give the code TKSS
to the word "texas" but the C code assigns is "TKS".
Just differing metaphone variants, witness...
Texas = Tex[ess] (if phonetically pronounced - almost slurred sounding)
just in case...
not sure its wise to use 'm += var' with digits:
m += string # valid
m += "7" # may be invalid if its a digit (even if quoted)
In article <va3k5u$3n2um$1@dont-email.me>,
Mike Sanders <porkchop@invalid.foo> wrote:
just in case...
not sure its wise to use 'm += var' with digits:
m += string # valid
m += "7" # may be invalid if its a digit (even if quoted)
In AWK, these are very different things.
+= is always arithmetic; it is not string concatenation at all.
Ben Bacarisse <ben@bsb.me.uk> wrote:
I don't know what your are asking for as this (your latest AWK) is not
just an implementation of the metaphone algorithm. With the extra
Levenshtein test it "texas" matches only a few words.
Not seeking/asking for anything Ben, just enjoy the ride =)
As for my Metaphone take... In fact it is. Several Metaphone variants
use Levenshtein & can be any mixture of three types of Metaphone
versions further still, or even a mix. Seems that's the way it is
in the wild...
There are certainly variants (and developments of the original) but I
thought you were implementing the same algorithm as the code you
referenced. Sorry for the confusion.