If you are considering indexing a sequence of text documents with a view to applying some sort of search facility then there is a lot to be said for stemming.
Stemming is the process of reducing words to their stem or root such that related words are reduced to the same stem. The result may not be the true etymological stem – all stemming algorithms look for is consistency. The benefit is that searches based upon such stems can find references for related words and this reduces the pressure on search users to guess the optimal words or phrases for a given requirement.
Some examples might be:
caresses => caress
ponies => poni
cats => cat
matting => mat
meetings => meet
Perhaps the best known stemming algorithm is the “Porter Stemmer” by Martin Porter. This algorithm was “frozen” in the past and now new developments in such techniques are based upon the newer Snowball stemmer.
C# Snowball* stemmers for a range of human languages can be found on Codeplex with the warning that the code has been ported from Java (a brief review more or less confirmed that). Martin Porter’s stemming site presents his algorithm in a range of languages but again it is clear from the structure that the C# versions are translated.
I took one of the C# Porter Stemmer versions and ran it against the 23,531 word test set and reassured by a perfect score decided to then “translate” the code into something just a little easier to follow and possibly tweak. Now my version (presented below) is not a Porter Stemmer as the output varies from the standard for a handful of words (I just have to mess with things).
I preferred:
advertise, advertised and advertising being stemmed to advert rather than advertis
merchandise becomes merchand and not merchandis
etc.
You probably get the picture – I have treated “ise” the same as “ize” (which works better in GB English I think) but it would not take anyone 5 minutes to undo if you wanted to be a purist but alternately you might fancy tweaking things further.
The code presented here is almost certainly much slower than the original but it is way easier to follow and debug. I have also added a cache so that words previously stemmed can be retrieved without re-processing although I will have to watch cache growth if the process is long running.
*More details of the Snowball algorithm can be found here http://snowball.tartarus.org/
<Edit>Fixed the colour errors in the demo code. The cause (something to do with text size) is under investigation</Edit>
Stemming is the process of reducing words to their stem or root such that related words are reduced to the same stem. The result may not be the true etymological stem – all stemming algorithms look for is consistency. The benefit is that searches based upon such stems can find references for related words and this reduces the pressure on search users to guess the optimal words or phrases for a given requirement.
Some examples might be:
caresses => caress
ponies => poni
cats => cat
matting => mat
meetings => meet
Perhaps the best known stemming algorithm is the “Porter Stemmer” by Martin Porter. This algorithm was “frozen” in the past and now new developments in such techniques are based upon the newer Snowball stemmer.
C# Snowball* stemmers for a range of human languages can be found on Codeplex with the warning that the code has been ported from Java (a brief review more or less confirmed that). Martin Porter’s stemming site presents his algorithm in a range of languages but again it is clear from the structure that the C# versions are translated.
I took one of the C# Porter Stemmer versions and ran it against the 23,531 word test set and reassured by a perfect score decided to then “translate” the code into something just a little easier to follow and possibly tweak. Now my version (presented below) is not a Porter Stemmer as the output varies from the standard for a handful of words (I just have to mess with things).
I preferred:
advertise, advertised and advertising being stemmed to advert rather than advertis
merchandise becomes merchand and not merchandis
etc.
You probably get the picture – I have treated “ise” the same as “ize” (which works better in GB English I think) but it would not take anyone 5 minutes to undo if you wanted to be a purist but alternately you might fancy tweaking things further.
The code presented here is almost certainly much slower than the original but it is way easier to follow and debug. I have also added a cache so that words previously stemmed can be retrieved without re-processing although I will have to watch cache growth if the process is long running.
/// <summary>
/// AditStemmer is a stemmer algorithm based upon the Porter stemmer algorithm
/// but has some slight but deliberate differences in output
/// see http://tartarus.org/~martin/PorterStemmer/ for more details and test data
/// This software is completely free for any purpose but is not warrented to be defect free
/// </summary>
class AditStemmer
{
private Dictionary<string, string> cache = new Dictionary<string, string>(); // could initialise this to avoid known issues (eg universal, university, universe)
public string stemWord(string word)
{
string stemWord;
if (!cache.TryGetValue(word, out stemWord))
{
stemWord = tryStem(word);
cache.Add(word, stemWord);
}
return stemWord;
}
private string tryStem(string word)
{
if (word.Length > 2)
{
// following the steps of giants - step 1 remove plurals, -ed, -ing
word = step1(word);
word = step2(word);
word = step3(word);
word = step4(word);
word = step5(word);
word = step6(word);
}
return word;
}
private string step1(string word)
{
int wordLen = word.Length;
if (word.EndsWith("s"))
{
if (word.EndsWith("sses") || word.EndsWith("ies"))
{
wordLen -= 2;
}
else if (!word.EndsWith("ss"))
{
wordLen--;
}
word = word.Substring(0, wordLen);
}
if (word.EndsWith("eed"))
{
if (countConsonants(word, wordLen - 1) > 0)
{
wordLen--;
}
}
else
{
if ((word.EndsWith("ed") && hasVowelInStem(word, wordLen - 2)) || (word.EndsWith("ing") && hasVowelInStem(word, wordLen - 3)))
{
int checkTo = wordLen - (word.EndsWith("ed") ? 2 : 3);
string stem = word.Substring(0, checkTo);
if (stem.EndsWith("at") || stem.EndsWith("bl") || stem.EndsWith("iz") || stem.EndsWith("is"))
{
word = stem + "e";
}
else if (isDoubleCon(word, checkTo - 1))
{
if (!(stem.EndsWith("l") || stem.EndsWith("s") || stem.EndsWith("z")))
{
word = stem.Substring(0, checkTo - 1);
}
else { word = stem; }
}
else if (countConsonants(word, checkTo) == 1 && isConVowCon(word, checkTo - 1))
{
word = stem + "e";
}
else
{
word = stem;
}
wordLen = word.Length;
}
}
return word.Substring(0, wordLen);
}
private string step2(string word)
{
// not too exciting just changes y to i if another vowel in the stem - but that comes in useful later
if (word.EndsWith("y") && hasVowelInStem(word, word.Length - 1))
{
word = word.Substring(0, word.Length - 1) + "i";
}
return word;
}
private string step3(string word)
{
if (word.EndsWith("ational")) { return conditionalReplace(word, "ational", "ate"); }
if (word.EndsWith("tional")) { return conditionalReplace(word, "tional", "tion"); }
if (word.EndsWith("enci")) { return conditionalReplace(word, "enci", "ence"); }
if (word.EndsWith("anci")) { return conditionalReplace(word, "anci", "ance"); }
if (word.EndsWith("izer")) { return conditionalReplace(word, "izer", "ize"); }
if (word.EndsWith("bli")) { return conditionalReplace(word, "bli", "ble"); }
if (word.EndsWith("alli")) { return conditionalReplace(word, "alli", "al"); }
if (word.EndsWith("entli")) { return conditionalReplace(word, "entli", "ent"); }
if (word.EndsWith("eli")) { return conditionalReplace(word, "eli", "e"); }
if (word.EndsWith("ousli")) { return conditionalReplace(word, "ousli", "ous"); }
if (word.EndsWith("ization")) { return conditionalReplace(word, "ization", "ize"); }
if (word.EndsWith("isation")) { return conditionalReplace(word, "isation", "ize"); }
if (word.EndsWith("ation")) { return conditionalReplace(word, "ation", "ate"); }
if (word.EndsWith("ator")) { return conditionalReplace(word, "ator", "ate"); }
if (word.EndsWith("alism")) { return conditionalReplace(word, "alism", "al"); }
if (word.EndsWith("iveness")) { return conditionalReplace(word, "iveness", "ive"); }
if (word.EndsWith("fulness")) { return conditionalReplace(word, "fulness", "ful"); }
if (word.EndsWith("ousness")) { return conditionalReplace(word, "ousness", "ous"); }
if (word.EndsWith("aliti")) { return conditionalReplace(word, "aliti", "al"); }
if (word.EndsWith("iviti")) { return conditionalReplace(word, "iviti", "ive"); }
if (word.EndsWith("biliti")) { return conditionalReplace(word, "biliti", "ble"); }
if (word.EndsWith("logi")) { return conditionalReplace(word, "logi", "log"); }
return word;
}
private string step4(string word)
{
if (word.EndsWith("icate")) { return conditionalReplace(word, "icate", "ic"); }
if (word.EndsWith("ative")) { return conditionalReplace(word, "ative", ""); }
if (word.EndsWith("alize")) { return conditionalReplace(word, "alize", "al"); }
if (word.EndsWith("alise")) { return conditionalReplace(word, "alise", "al"); }
if (word.EndsWith("iciti")) { return conditionalReplace(word, "iciti", "ic"); }
if (word.EndsWith("ical")) { return conditionalReplace(word, "ical", "ic"); }
if (word.EndsWith("ful")) { return conditionalReplace(word, "ful", ""); }
if (word.EndsWith("ness")) { return conditionalReplace(word, "ness", ""); }
return word;
}
private string step5(string word)
{
int truncAt = word.Length;
if (word.EndsWith("al")) { truncAt -= 2; }
if (word.EndsWith("ance") || word.EndsWith("ence")) { truncAt -= 4; }
if (word.EndsWith("er")) { truncAt -= 2; }
if (word.EndsWith("ic")) { truncAt -= 2; }
if (word.EndsWith("able") || word.EndsWith("ible")) { truncAt -= 4; }
if (word.EndsWith("ant")) { truncAt -= 3; }
if (word.EndsWith("ement"))
{
truncAt -= 5;
}
else if (word.EndsWith("ment"))
{
truncAt -= 4;
}
else if (word.EndsWith("ent"))
{
truncAt -= 3;
}
if (word.EndsWith("ion") && word.Length > 4)
{
if (word.ElementAt(word.Length - 4) == 't' || word.ElementAt(word.Length - 4) == 's')
{
truncAt -= 3;
}
}
if (word.EndsWith("ou")) { truncAt -= 2; }
if (word.EndsWith("ism")) { truncAt -= 3; }
if (word.EndsWith("ate")) { truncAt -= 3; }
if (word.EndsWith("iti")) { truncAt -= 3; }
if (word.EndsWith("ous")) { truncAt -= 3; }
if (word.EndsWith("ive")) { truncAt -= 3; }
if (word.EndsWith("ize")) { truncAt -= 3; }
if (word.EndsWith("ise")) { truncAt -= 3; }
if (countConsonants(word, truncAt) > 1)
{
word = word.Substring(0, truncAt);
}
return word;
}
private string step6(string word)
{
// mostly removes trailing e
if (word.EndsWith("e"))
{
int cCount = countConsonants(word, word.Length - 1);
if (cCount > 1 || cCount == 1 && !isConVowCon(word, word.Length - 2))
{
word = word.Substring(0, word.Length - 1);
}
}
// reduces ll to l where appropriate
if (word.EndsWith("ll") && countConsonants(word, word.Length - 1) > 1)
{
word = word.Substring(0, word.Length - 1);
}
return word;
}
private string conditionalReplace(string word, string replace, string with)
{
// the condition is that there is at least one preceeding consonant sequence
int stemLength = word.Length - replace.Length;
if (countConsonants(word, stemLength) > 0)
{
word = word.Substring(0, stemLength) + with;
}
return word;
}
private int countConsonants(string word, int to)
{
//counts sequences of consonants before to but after any initial consonant sequence
int cCount = 0;
bool wasVowel = false;
for (var ci = 0; ci < to; ci++)
{
if (hasConsonantAt(word, ci))
{
if (wasVowel)
{
wasVowel = false;
cCount++;
}
}
else
{
wasVowel = true;
}
}
return cCount;
}
private bool isDoubleCon(string word, int at)
{
// is double consonant
if (at < 1) { return false; }
if (word.ElementAt(at) != word.ElementAt(at - 1))
{
return false;
}
return hasConsonantAt(word, at);
}
private bool isConVowCon(string word, int to)
{
/* is true <=> i-2,i-1,i has the form consonant - vowel - consonant
and also if the second c is not w,x or y. this is used when trying to
restore an e at the end of a short word. e.g.
cav(e), lov(e), hop(e), crim(e), but
snow, box, tray.
*/
if (to < 2 || !hasConsonantAt(word, to) || hasConsonantAt(word, to - 1) || !hasConsonantAt(word, to - 2))
{
return false;
}
if (word.ElementAt(to) == 'w' || word.ElementAt(to) == 'x' || word.ElementAt(to) == 'y')
{
return false;
}
return true;
}
private bool hasVowelInStem(string word, int to)
{
for (var vi = 0; vi < to; vi++)
{
if (!hasConsonantAt(word, vi))
{
return true;
}
}
return false;
}
private bool hasConsonantAt(string word, int at)
{
switch (word.ElementAt(at))
{
case 'a':
case 'e':
case 'i':
case 'o':
case 'u':
return false;
case 'y':
return (at == 0) ? true : !hasConsonantAt(word, at - 1); // y can act as consonant and vowel
default:
return true;
}
}
}
*More details of the Snowball algorithm can be found here http://snowball.tartarus.org/
<Edit>Fixed the colour errors in the demo code. The cause (something to do with text size) is under investigation</Edit>
No comments:
Post a Comment