Monday, September 20, 2010

Lesson 9. Tries

Lesson 9. Tries

NOTE: For whatever reason, on this post Blogger does not show it when I cout an endl, so if you see a single less than sign where there's supposed to be a stream operator, it's probably supposed to be printing an endl there.



So, you just learned about binary search trees, which can store data of enumerated types like ints and chars and cut the search time down to logarithmic runtime. The next question, though, is how do we make a tree like that for strings? Well, if we think of strings as being made of characters, then that would mean that we could possibly have a data structure that would let us reTRIEve a string from a tree that searches for the string one character at a time. That data structure is called a trie, and its nodes have this struct:


struct tNode

{
    vector links;
    string isString;
    char letter;
};


Each tNode has a char, letter, which stores the char that it carries.



Its string isString is an empty string if the char it stores is not the last char of a string THAT IS IN THE TREE (I’ll get back to this point later.)


Links is where it gets interesting. Since each tNode stores one letter of a string, we have a large possible number of pointers, one pointing to a node holding each char. Now, let’s say we’re only working with the 26 letters of the alphabet. That means we would have an array of 26 pointers to tNodes. That is highly space-inefficient and time-inefficient to have to make and initialize all these pointers that may or may not even point to a tNode.


Because of this, instead of having an array of tNode*s, we have a VECTOR of tNode*s, which grows and shrinks depending on the number of tNodes that each tNode has pointers to. (This tutorial will use the vector’s member functions, which are available at cplusplus.com, but they are for the most part self-explanatory due to how they are named).


Now that you know how a tNode works, I will show you the tNode class and its constructor:

class Trie

{
    public:
        tNode* root;
        Trie();
        ~Trie();
        bool isEmpty();
        void remove(string word, unsigned int index, tNode* ptr);
        bool contains(string word, unsigned int index, tNode* ptr);
        void insert(string word, unsigned int index, tNode* ptr);
        void levelOrder(tNode* ptr);
    private:
        void clear(tNode* ptr);
        queue nodes;
};
Trie::Trie()
{
    root = new tNode;
}

One thing you should know about tries is that in my implementation of tries, the root is a dummy node; its letter and isString values are not initialized since this dummy node is only used for its vector of pointers to the tNodes containing the first characters of the strings stored in the trie. However, since that vector is the only useful part of the root node, the rest of the node doesn’t have to be used in an implementation of a trie.



Before we continue, I am going to tell you how to read my drawings of tries.


1. Each tNode has two halves. The top half contains the node’s char (not in quotation marks) if its isString is empty (in other words, the node holds a char that ISN’T the last char of a string that is in the trie). The top half contains the node’s string in quotation marks if isString is not blank (in other words, the node holds a character that is also the last character of a string that IS in the trie). The bottom half represents the vector of pointers to tNodes, with the chars in the vector representing the chars of the tNodes that each pointer points to.


2. In the root node, since the root node’s letter and isString are uninitialized due to the fact that they are not used in the trie, the root node’s top half has a nullset in it.


3. If a tNode’s vector is just a box with a nullset in it, that means that the tNode has an empty vector and therefore does not have any pointers pointing to other tNodes.






Now that you know how to read this drawing, here’s a drawing of a trie holding ten words:

 
Notice here that each level of the trie holds a char at a different position in whatever strings it holds. Also, notice that there are no pointers to NULL; the vectors take care of the pointers for us by completely removing any pointers we don’t need, so we don’t need to have any pointers initialized to point to NULL.







Because the root node is always in the trie, in a trie the isEmpty function looks a little different from other isEmpty functions:

bool Trie::isEmpty()
{
    return root->links.empty();

}

Since a trie’s root node only contains a vector of POINTERS TO the tNodes that contain the first chars of every string in the trie, if the root node’s vector is empty, that means that the root node does not contain any pointers.  Therefore, if the root node’s vector is empty, then the trie does not hold any strings, so the trie is empty.  This is what an empty trie looks like:
Now, you will see the contains function, which checks to see if the trie contains the string we are searching for.
//For contains to work, in the first call of the function pass 0 as index and pass root as ptr.
bool Trie::contains(string word, unsigned int index, tNode* ptr)
{
    if(index < word.length())//1   
    {
        for(unsigned int i = 0; i < ptr->links.size(); i++)//2
        {
            if(ptr->links.at(i)->isString.compare(word) == 0)
                return true;//3
            if(ptr->links.at(i)->letter == word.at(index))
                return contains(word, index+1, ptr->links.at(i));//4
        }
    }
    return false;//5
}

1/5. If the index is less than the length of the word we are searching for, then we look through the current node’s vector for a pointer to a node holding the character of the string at index’s position (ex in the first recursion, when index is 0, we look for a pointer to a node holding the first letter of the string). If the index is at the length of the word, then we are on the node holding the last character of the string, but it is not holding the string we are searching for, so the function returns false.



2. At each node, we iterate through its vector looking for a pointer to a node holding the string’s character at index’s position. For the iterator we use an unsigned int so that is compatible with ints returned from member functions of vectors and strings.


3/4. When the iterator reaches a pointer that points to a node holding the string’s character at index’s position…


If index is less than length - 1, then that pointer brings us to a new recursion of the function with index incremented by one; in that next recursion we will be looking for a pointer to a node with the next character.


If index is length - 1, then the node we found contains the last character of the string as its letter. Because of this, whether or not the trie contains the string we are looking for depends on this node’s isString; if isString contains the string we are looking for, the function returns true. Otherwise, another recursion starts on this node, but index is equal to length so the function returns false.


5. This line is not an else because false is supposed to be returned either if index is at length (we are on a recursion at the node containing the last character of the string but its isString is blank) or another return statement did not start the next recursion of the function (after iterating through the node’s vector, the function could not find a node containing the next char).


Now, here is a picture of the contains function. Please note that that in each drawing, for each recursion the node that ptr points to is the top node of the recursion ring. The node the arrow points to is the node that ptr will point to in the next recursion.

Now, I will show you the insert function for a trie:
//For insert to work properly, in the first call of insert pass 0 as index and root as ptr.

void Trie::insert(string word, unsigned int index, tNode* ptr)
{
    unsigned int loc = -1;//1
    for(unsigned int i = 0; i < ptr->links.size(); i++)
    {
        if(ptr->links.at(i)->letter==word.at(index))
            loc = i;
    }
    if(loc == -1)//2
    {
        loc = ptr->links.size();
        tNode* add = new tNode;
        add->letter = word.at(index);
        if(index == word.length()-1)//3
            add->isString = word;
        else
            add->isString = "";
        ptr->links.push_back(add);
        if(index < word.length() - 1)//4
            insert(word, index+1, ptr->links.back());
    }
    else
    {
        if(index < word.length() - 1)//5
            insert(word, index+1, ptr->links.at(loc));
        else//6
        {
            if(word.compare(ptr->links.at(loc)->isString) == 0)
                cout<< "Error: This word is already in the trie"<
            else
                ptr->links.at(loc)->isString = word;
        }
    }
}

1. For whatever node we are currently on, we iterate through its vector. If we find a pointer to a node containing the character at index’s position, then the pointer’s position in the vector is recorded in loc over its dummy value of -1.



2/3. If loc remains at -1 after the iteration (either the vector does not contain a pointer to a node with the character at index’s position or the vector is empty), then we create a pointer to a node containing the character at index’s position. If this character is the last character of the string, then we set its isString to the string we are inserting; otherwise we set this value to an empty “”. We then push this pointer into the vector of the node we are currently on.


4. If the pointer that we just created does not point to the node holding the last character of the string, then that means there are more characters to insert, so we start a new recursion of the function by incrementing index by 1 and passing the pointer we just inserted as ptr.


5. If loc is not -1, then that means that in our iterations through the current node’s vector we found a pointer to a node containing the char at index’s position. If the current index is not length-1, then we will start a new recursion at the node it that the pointer points to with index incremented by 1.


6. If we are at position length - 1, then if the node that the pointer at position loc of the vector points to currently holds the string that we want to insert, then that means that the trie already contains the string we want to insert. Otherwise, we set the isString of this node equal to the string we are trying to insert.

Now I will show you the remove function.
//Invariant: Remove is not to be used on an empty trie.

//For remove to work, in the first call of the function pass 0 as index and pass root as ptr.
void Trie::remove(string word, unsigned int index, tNode* ptr)
{
    int hasChar = -1;//1
    for(unsigned int i = 0; i < ptr->links.size(); i++)
    {
        if(ptr->links.at(i)->letter == word.at(index))//2
        {
            if(word.compare(ptr->links.at(i)->isString) == 0)
            {
                hasChar = i;
                ptr->links.at(i)->isString = "";//3
            }
            else
            {
                if(index < word.length() - 1)//4
                {
                    hasChar = i;
                    remove(word, index + 1, ptr->links.at(i));
                }
                else//5
                    cout<<"Error: This word is not in the trie!"<
            }
        }
    }
    if(hasChar != -1)//6
    {
        if(ptr->links.at(hasChar)->isString.compare("") == 0 && ptr->links.at(hasChar)->links.empty())
        {
            delete ptr->links.at(hasChar);
            ptr->links.erase(ptr->links.begin()+hasChar);
        }
    }
    else//7
        cout<<”Error: This word is not in the trie!”<
}

1/2. Like with loc in insert, in remove we have an integer variable hasChar. In each recursion, we iterate through the current node’s vector to try and find a pointer to a node holding the character at index’s position; if we do, then hasChar is set to that pointer’s position in the vector. Like loc, hasChar is initialized to a dummy value of -1.




3. If that pointer’s isString is the same string as the word we want to remove, then we set that pointer’s isString to be an empty string.


4. Otherwise, if index is still less than the length of the string we want to remove minus one, then we start another recursion of the function using the pointer we found as ptr and incrementing index by one.


5. However, if index is equal to the length of the string we want to remove minus one, then we have already searched every character of the string and found that the trie does not contain the string we want to remove.


6. After this, in each recursion of the function we still have our pointers that we used for starting the next recursion in place.


To delete unused space, if the vector of a node being pointed to (by a pointer in that recursion’s node’s vector) is empty and that node’s isString is blank, then the node is not in use since only nodes with strings can have empty vectors, so we delete that node, and in the vector that contains the pointer to that node, the pointer is erased.


Since each recursion is halted at the call of the next recursion until that recursion ends, this recursive deletion of nodes happens with the last node deleted first in reverse order, causing a deletion chain reaction until we reach a node that is in use either due to having a non-empty vector or due to being the last character of a string.


7. If hasChar is -1, then the trie does not contain the letter of the string at index’s position, so the trie does not contain the string we want to remove.

Now, we will learn how to do a level-order traversal of a trie.



A trie’s level-order traversal is very similar to a binary tree’s level-order traversal. The only real difference is that now we need to account for the fact that different nodes have different numbers of children. However, since we can iterate through the vector of pointers to each node’s children, that isn’t too hard.

//For levelOrder to work, pass root as ptr.

void Trie::levelOrder(tNode* ptr)
{

    for(unsigned int i = 0; i < ptr->links.size(); i++)//1
        nodes.push(ptr->links.at(i));
    if(ptr != root)//2
    {
        if(ptr->isString.compare("")==0)//3
            cout<letter<<"# ";
        else//4
        {
            for(unsigned int i = 0; i < ptr->isString.length(); i++)
                cout<isString.at(i);
            cout<<"! ";
        }
        nodes.pop();
    }
    if(!nodes.empty())//5
        levelOrder(nodes.front());
    else
        cout<
}

1. In each node, we iterate through its vector to find every child of that node, and we push the pointer to each child into the queue.




2. The code in this if statement is for printing the data of each node. If the node we are currently on is the root, we do not execute this section of the function because otherwise the first node is popped before its data gets printed.


3. If the node that ptr is pointing to has a blank string at isString, then we print the pointer’s letter and a # sign.


4. If the node that ptr is pointing to is holding a string, then we print the pointer’s word followed by an exclamation point (note that because strings are incompatible with cout, the string has to be printed one character at a time.)


5. Pretty much the same as the end of the BST level-order traversal function; if the queue is not empty, then start a recursion on the front element of the queue, otherwise print an endl.


The result is that it prints in order of which position each char is (look at the results below, then follow along with the complete trie at the top of the tutorial), printing the nodes of the first chars first, then the nodes of the second chars, and so on.


For example, this is the result of a level-order traversal of the trie at the beginning of this tutorial (the printing was separated so that you can see the chars being printed as the first chars and one-char strings at the first level, the second chars and two-char strings at the second level, and so on).


(First characters)T# A! (Second characters)H# A# AT! L# (Third characters)THE! A# I# L# ALL! (Fourth characters)THEM! THEY! THAT! THIS! TALL! ALLY!


Finally, I will show you the clear function for a trie, as well as the class destructor.

void Trie::clear(tNode* ptr)

{
    for(unsigned int i = 0; i < ptr->links.size(); i++)
    {
        clear(ptr->links.at(i));
    }
    delete ptr;
}
Trie::~Trie()
{
    clear(root);
}

Essentially what happens is that we start at the root of the trie and iterate through its vector, call recursions of the function on each of the root’s children, and then do the same with its children on each recursion. In each recursion, after clear is called on each of a node’s children, that node is deleted. This therefore deletes every node of the trie, including the root node, which is why it is that clear is only to be called in the trie class destructor. The destructor-clear duo is essentially the same for tries as it is for BSTs, and as you will find in the next tutorial, it also works the same way in Patricia trees.




That’s it for tries. In the next tutorial, you will be learning about Patricia trees, the infamous data structure that is actually pretty awesome when you know how to implement them. How is it awesome? Well, you really get down and dirty with every string by looking not just at the character level but the BINARY level!

No comments :

Post a Comment