Monday, September 20, 2010

Lesson 10. PATRICIA trees

Lesson 10. Patricia trees


Another type of trie is the PATRICIA tree. PATRICIA stands for Practical Algorithm To Retrieve Information Coded In Alphanumeric. The Patricia Tree is based on using C-strings as its data, and its nodes look something like this:

struct pNode

{
    char* data;//A C-string
    bool isLeaf;
    int index;
    pNode* left;
    pNode* right;
};

Leaves are the only nodes capable of storing strings; if isLeaf is false, then data is set to “”, an empty C-string. Index means the index of the first difference at the binary levels between the strings inserted.







To understand Patricia trees, one must break down strings/C-strings based on what we know.


A string is essentially an array of characters plus one final character called the null terminator.


A character is essentially an array of eight bits.


Bits are binary digits, which can only be 0 or 1, much like bools can only be false or true.


If a bool is set to the int value 0 it is evaluated as false. If it is set to the int value 1 it is evaluated as true.


So since bools only have two possible values, much like bits,


We can think of a character as an array of eight bools.


So we can think of a string as an array of bools that is eight times the number of characters in the string plus eight 0’s to represent a null terminator at the end of the string.


Now prepare to see the Noobtricia Tree’s nodes.


struct pNode

{
    string data;
    bool bits[40];
    bool isLeaf;
    int index;
    pNode* zero;
    pNode* one;
};

A Noobtricia tree is a version of a Patricia tree that I made up as a way to teach Patricia trees. Instead of dealing with the C-string and its bits as part of the same data, Noobtricia trees deal with them separately, making it so that we have a regular string instead of a C-string as our data, and I also added an array of 40 bools to represent the bits of a string with a length up to four characters (first 8, 16, 24, or 32 bools) plus a null terminator (the eight bits after the character bits; if the string is less than four characters long, then the first eight-bit character to be 00000000 is the null terminator and all bits after that are not used). This makes it so that the Noobtricia tree removes confusion by dealing with the string and its bits separately.



In addition, since nodes that are not leaves are used to separate nodes at the first bit in which there is a difference between them (this bit is stored in that node as index), the pointers left and right were renamed zero and one; if a string’s bool at the current node’s index is false or 0, the string is stored in the node’s zero/left subtree. If that bool is true or 1, it is stored in a node within the current node’s one/right subtree.


If a node is a leaf, then we will give the node an index of -1 because by definition, leaves do not have subtrees, so their pointers point to NULL.


If a node is not a leaf, then we will give the node an empty string because only leaves store strings.






pNodes look like this in Noobtricia trees, and in Patricia trees they look like that except without the array of bits:

However, since if isLeaf is true we know that data will be a string and index will be -1 but if isLeaf is false data will be an empty string and index will be an int, in all diagrams I will use a shorthand drawing in which in leaves only the data will be shown and in non-leaves only the index will be shown.







So, the tree has two pointers and it is used to do a binary search for strings. Sound familiar? Well, if it reminds you of binary search trees there’s a good reason why.


In binary search trees, if you insert an int, to find the subtree it is inserted into, the int can only be one thing or the other; it can either be greater than the current node’s data or less than the current node’s data.


In Patricia trees, if you insert a string and its array of bools, then by the definition of a bool, at each index, the string’s array can only have one value or the other; it can only be false/zero or true/one. Because of this property, Patricia trees make it so that they are essentially the binary search trees of strings.


Interestingly enough, in addition to giving the Patricia tree a binary search property, looking at strings as an array of eight-bit characters ending with an eight-bit null terminator means that the null terminator makes it so that no string is a prefix of another string. For example, THE as a C-string is


THE\0


and THEM is


THEM\0


THE\0 is not a substring of THEM\0 because THE’s fourth character is a null terminator while THEM’s fourth character is an M, with the fifth character being its null terminator. If we look at those words in the form of their bits, we see this:


01010100(T),01001000(H),01000101(E),00000000(\0),00000000(\0) THE


01010100(T),01001000(H),01000101(E),01001101(M), 00000000(\0) THEM


Look at the fourth set of eight bits, the one that corresponds to the first \0 on THE and the M on THEM. You will see that in the second character of that set, which is also position 25 of the whole array, THE has a 0 and THEM has a 1. This means that the first difference between THE\0 and THEM\0 is in the character at position 3 and the bit at position 25. They would be separated by a node with 25 as its index, with THE in its zero subtree and THEM in its one subtree.






Now that the information about bits is established, it’s time to see what the Noobtricia Tree’s class and constructor looks like:

class pTree

{
    private:
        queue nodes;
    public:
    pNode* root;
        pTree();
        ~pTree();
        void clear(pNode* ptr);
        bool isEmpty();
        void insert(pNode* ptr, string word, bool wordBits[]);
        void remove(pNode* ptr, string word, bool wordBits[]);
        void levelOrder(pNode* ptr, bool isRoot);
        bool contains(pNode* ptr, string word, bool wordBits[]);
};
pTree::pTree()
{
    root = NULL;
}

Just as a preview, I will show you the whole Patricia tree that we will be working with, which has every word inserted into the tree.
(Note that we are working with the same ten words from the trie tutorial)
That’s a big tree there.  Yep, looks like now you’re playing in the big leagues!  You’ll also realize you’re in the big leagues because you’re working with this bad boy, a chart of every word in the tree translated to binary:
So how do we code all the functions of this data structure on steroids? Let’s find out!


We’ll start out with one of the easy functions:

bool pTree::isEmpty()

{
    return root == NULL;
}

Okay, no surprises there… it’s just like any other isEmpty function of a linked data structure.  Now you will see another familiar-looking function, contains.
//For contains to work, pass head as ptr in the first function call. Also, if you are passing wordBits to a function, what that means is that you are passsing the array of bools that corresponds to the string you are passing.

bool pTree::contains(pNode* ptr, string word, bool wordBits[])
{
    if(isEmpty())//1
        return false;
    if(!(ptr->isLeaf))//2
    {
        if(wordBits[ptr->index]==false)
        {
            return contains(ptr->zero, word, wordBits);
        }
        else
        {
            return contains(ptr->one, word, wordBits);
        }
    }
    else//3
    {
        if(word.compare(ptr->data) == 0)
            return true;
        else
            return false;
    }
}

1. If the tree is empty, then the tree does not contain the string you are looking for, for obvious reasons.



2. Only leaves contain strings, so the function recurses to get to a leaf. If our string’s bool at the index of the current node (ex if the current node has 1 as its index we check the second bool in our bit array) is 0/false, we call the next recursion in the zero/left subtree of the current node. If the bool is 1/true, we call the next recursion in the one/right subtree of the current node.


3. When we reach a leaf, we check if the leaf’s data is the same string we are searching for. If it is, then the tree contains the string so the function returns true. Otherwise, the tree does not contain the string so the function returns false.






You are now about to learn the really hard functions: insert and remove (which, by the way, wasn’t even in the Patricia trees we used when I took COMP 15). Between these functions, there are 213 lines of code. But before we start that, there’s a function I’d like to introduce to you: firstDiff.


In the insert function, we will be inserting, and one part of the insert function is a recursion section in which we recurse until we reach a leaf. Once we reach that leaf, we compare the string in the leaf to the string that we are trying to insert to find the first bit which is different between the two strings.


We could do that by simply comparing the bit arrays, but in a real Patricia tree, where a bit array is never passed, we first find the first character that is different between the two strings and then we find the first bit that is different.


This is a confusing process since we are using a new operator called the bitwise &, but to simplify the process while still starting the search for the first different bit with a search for the first different character, we use the function firstDiff to find the first different character:

int firstDiff(string wordA, string wordB)//1

{
    if(wordA.compare(wordB)==0)//2
        return -1;
    else
    {
        bool foundDiff = false;//3
        unsigned int diff = 0;
        while(!foundDiff)
        {
            if(diff==wordA.length()||diff==wordB.length())//4
                foundDiff = true;
            else
                if(wordA.at(diff) != wordB.at(diff))//5
                    foundDiff = true;
            if(!foundDiff)//6
            diff++;
        }
    return diff;
    }
}

1. The first thing I would like you to note is that this is not a member function of the pTree class; it can be used on any two strings in and out of the class.



2. If the two strings are the same, then -1 is returned, signifying that the two strings are the same.


3. foundDiff is the value that tells us whether or not we have found the difference between the two strings. diff is the iterator that keeps increasing until we have found the first difference between the two strings, changing foundDiff to true.


4. If diff is equal to the length of one of the words, then one of the words does not have a character at position diff (analogous to the word’s null terminator), so foundDiff is true.


5. If the character at position diff is not the same in both words, then foundDiff is true.


6. If there is no difference between the strings at position diff, then diff increases by one. Otherwise, diff is returned as the position at which we see the first difference between the two strings.


//For insert to work, pass root as ptr in the function call.

void pTree::insert(pNode* ptr, string word, bool wordBits[])
{
    if(isEmpty())//1
    {
        root = new pNode;
        root->data = word;
        root->isLeaf = true;
        root->index = -1;
        for(int i = 0; i < 40; i++)//2
        {
            root->bits[i] = wordBits[i];
        }
        root->zero = NULL;
        root->one = NULL;
    }
    else
    {
        if(!(ptr->isLeaf))//3
        {
            if(wordBits[ptr->index] == false)
                insert(ptr->zero, word, wordBits);
            else
                insert(ptr->one, word, wordBits);
        }
        else
        {
            if(word.compare(ptr->data) == 0)//4
                cout<<"Error: This word is already in the tree!”;
            else
            {
                int bitLoc = 8 * firstDiff(word, ptr->data);//5
                int firstDiff = -1;
                for(int i = bitLoc; i < bitLoc + 8; i++)
                    if(wordBits[i] != ptr->bits[i] && firstDiff == -1)
                firstDiff = i;
                pNode* diffNode = new pNode;//6
                diffNode->data = "";
                diffNode->isLeaf = false;
                diffNode->index = firstDiff;
                diffNode->zero = NULL;
                diffNode->one = NULL;
                if(wordBits[firstDiff] == false)//7
                {
                    diffNode->zero = new pNode;
                    diffNode->zero->data = word;
                    diffNode->zero->isLeaf = true;
                    diffNode->zero->index = -1;
                    diffNode->zero->zero = NULL;
                    diffNode->zero->one = NULL;
                    for(int i = 0; i < 40; i++)
                        diffNode->zero->bits[i] = wordBits[i];
                }
                else
                {
                    diffNode->one = new pNode;
                    diffNode->one->data = word;
                    diffNode->one->isLeaf = true;
                    diffNode->one->index = -1;
                    diffNode->one->zero = NULL;
                    diffNode->one->one = NULL;
                    for(int i = 0; i < 40; i++)
                        diffNode->one->bits[i] = wordBits[i];
                }
                if(root->isLeaf)//8
                {
                    if(wordBits[firstDiff] == false)
                        diffNode->one = root;
                    else
                        diffNode->zero = root;
                    root = diffNode;
                }
                else
                {
                    pNode* before = root;//9
                    bool placed = false;
                    while(!placed)//10
                    {
                        if(wordBits[before->index] == false)
                        {
                            if(before->zero->index < firstDiff && before->zero != ptr)
                                before = before->zero;
                            else
                                placed = true;
                        }
                        else
                        {
                            if(before->one->index < firstDiff && before->one != ptr)
                                before = before->one;
                            else
                                placed = true;
                        }
                    }
                    if(before == root && firstDiff < root->index)//11
                    {
                        if(diffNode->zero == NULL)
                            diffNode->zero = root;
                        else
                            diffNode->one = root;
                        root = diffNode;
                    }
                    else//12
                    {
                        if(wordBits[before->index] == false)
                        {
                            if(diffNode-> zero == NULL)
                                diffNode->zero = before->zero;
                            else
                                diffNode->one = before->zero;
                            before->zero = diffNode;
                        }
                        else
                        {
                            if(diffNode-> zero == NULL)
                                diffNode->zero = before->one;
                            else
                                diffNode->one = before->one;
                            before->one = diffNode;
                        }
                    }
                }
            }
        }
    }
}

Now for the explanation of all this code. Here goes!


1. If the tree is empty, then we make a new leaf node at root and initialize its values. Its pointers both point to NULL, its string is the string we passed in the function, its isLeaf is true, and its index is -1.

2. We iterate through the bit array we passed as wordBits to set the node’s bits one by one. You’ll be seeing that section of code a lot here.

3. Much like with the contains function, there is a recursion section of the function in which we recurse through the tree until we find a leaf node, which works the same way as the recursions in the contains function.

4. When we do find a leaf, we check if the leaf’s data matches the string that we are trying to insert into the tree. If it does, then we are inserting a duplicate value, and the function tells us this and then ends.

5. If we are not making a duplicte entry, then we now check to find the first bit at which the string we are entering and the string at the cureent node differ. To do that, first we find the first character at which they differ, which is found by using the firstDiff function. By definition, a character is a set of eight bits, so if there are two different characters, they can’t have the same eight bits. Therefore, the first bit to differ in the two strings must be one of the eight bits of the first character to differ.

So first, we find the index of the first character to differ between the two strings.

Then, we iterate through the eight bits of that character. To do this, we first create a variable equal to eight times the index of the character. This variable is bitLoc.

Then, we do a for loop to find the first bit of this character to differ. Since that character is eight bits long, we iterate through the bits ranging from the index of the first bit of the character, bitLoc, to the last bit of the character, bitLoc + 7.

The first time we find a bit that differs, we set firstDiff equal to that index. After that, since firstDiff no longer equals the dummy value of -1, any other differences between the bits are not taken into account.

6. We then create a node diffNode. This node is not a leaf, so its isLeaf is set to false, its string is set to “”, we don’t even bother initializing its bit array, its index is set to firstDiff, and for now its pointers will both be set to NULL, but not for long.

7. If the bit of the word we are inserting at the index of firstDiff is a zero/false, that means that the new node is supposed to be to the left/zero of diffNode, so its values are then initialized. If that bit is a one/true, then we do the same thing to the right.

8. If root is a leaf, then that means that the string we compared to the string we are inserting is the only string in the tree. Because of this, whichever pointer of diffNode is not occupied points to this node; at diffNode’s index, if the bit array of the string being inserted has a zero then the bit array of this other string has to have a one and vice versa.

9. Otherwise, we need to have a pointer pointing to the node that is to be before diffNode.

We know that the index of any node that is before diffNode must be lower than diffNode’s index.

Therefore, we have before iterate through the tree following the same path that was followed by ptr’s recursions until the node it is about to reach has an index greater than diffNode’s index.

10. We keep track of this using a bool called placed. When placed is true, before is placed at the node that has a pointer pointing to where we want diffNode to be. Before iterates through the tree, and the pointer it follows from each node is chosen based on the string’s bit that is at the current node’s index; therefore, before goes toward ptr’s leaf in the same direction ptr went.

However, before we make this pointer point to another node, the node that before is about to point to is checked; if the index of the node before is about to point to is greater than the index of diffNode or the node that before is about to point to is ptr’s leaf, then that means that before is now at the position that is before where we want to insert diffNode, so placed is set to true and the loop ends. Otherwise, placed remains set to false, before now points to the node it was about to point to before being checked, and the loop continues.

11. If before is root, then that means one of two things. Either diffNode is supposed to be the new root or diffNode goes directly to the left/zero or right/one of the root. If before is root and diffNode’s index is less than root’s index, then that means that diffNode MUST be the new root node.

Because of this, if diffNode’s zero pointer is pointing to the node we inserted (the string’s bit at diffNode’s index is a zero), then diffNode’s one pointer now points to the root node and the rest of the tree is diffNode’s right/one subtree. If diffNode’s one pointer is pointing to the node we inserted (the string’s bit at diffNode’s index is a one), then diffNode’s zero pointer now points to the root node and the rest of the tree is now diffNode’s left/zero subtree. After this, we have root’s pointer point to diffNode to make diffNode the new root node.

12. If before does not point to root’s node or root’s index is less than diffNode’s index, then that means that before has to be the node that has one of its pointers pointing to diffNode.

If the bit at before’s index is a zero, then before’s zero node points to diffNode and the pointer in diffNode that isn’t pointing to the leaf node of the string we are inserting is now pointing to before’s zero/left subtree.

If the bit at before’s index is a one, then before’s one node points to diffNode and the pointer in diffNode that isn’t pointing to the leaf node of the string we are inserting is now pointing to before’s one/right subtree.

Confused? Have a picture or four!



 

 
Well, that was a lot of text and pictures.  Here’s  some more for the remove function!
//Invariant: Remove is not to be used on an empty tree.
//To make it so that remove works, pass head as ptr.
void pTree::remove(pNode* ptr, string word, bool wordBits[])
{
    if(root->isLeaf)//1
    {
        if(word.compare(root->data) == 0)
        {
            delete root;
            root = NULL;
        }
        else
        {
            cout<<"Error: This word is not in the tree!”;
        }
    }
    else
    {
        if(word.compare(ptr->zero->data)!=0&&word.compare(ptr->one->data)!=0)//2
        {
            if(wordBits[ptr->index] == false && !(ptr->zero->isLeaf))//3
                remove(ptr->zero, word, wordBits);
            else if(wordBits[ptr->index] == true && !(ptr->one->isLeaf))
                remove(ptr->one, word, wordBits);
            else
                cout<<"Error: This word is not in the tree!";
       
}
        else//4
        {
            if(ptr == root)//5
            {
                if(word.compare(ptr->zero->data)==0)
                {
                    delete ptr->zero;
                    root = ptr-> one;
                    delete ptr;
                }
                else
                {
                    delete ptr->one;
                    root = ptr->zero;
                    delete ptr;
                }
            }
            else
            {
                pNode* before = root;//6
                while(before->zero != ptr && before->one != ptr)//7
                {
                    if(wordBits[before->index] == false)
                        before = before -> zero;
                    else
                        before = before -> one;
                }
                if(before->zero == ptr)//8
                {
                    if(word.compare(ptr->zero->data) == 0)
                    {
                        before->zero = ptr->one;
                        delete ptr->zero;
                        delete ptr;
                    }
                    else
                    {
                        before->zero = ptr->zero;
                        delete ptr->one;
                        delete ptr;
                    }
                }
                else
                {
                    if(word.compare(ptr->zero->data) == 0)
                    {
                        before->one = ptr->one;
                        delete ptr->zero;
                        delete ptr;
                    }
                    else
                    {
                        before->one = ptr->zero;
                        delete ptr->one;
                        delete ptr;
                    }
                }
            }
        }
    }
}

1. If root’s node is a leaf, we check to see if root’s node’s data matches the string we want to remove. If it does, then we delete root’s node and the tree is now empty. Otherwise, the tree does not contain the string we want to remove and the program tells us this.



2/3. We now have a recursion section in which we recurse through the tree until we find the node that is THE PARENT OF the leaf containing the string we want to remove (NOT THE LEAF ITSELF).


4. When we reach that node, we are now in the deletion section.


5. If ptr is pointing to root’s node, then one of root’s node’s pointers is pointing to the leaf containing the string we want to remove. We check which pointer that is and delete its node. We then have root point to the root node’s other child while root’s node is deleted.


6. If ptr is not pointing to root’s node, then we need to use a before node to find the node that is the parent of ptr’s node.


7. Before now iterates through the tree guided by the bit array of thestring we want to remove until before’s node has ptr’s node as one of its children.


8. We then take whichever one of before’s pNode*s is pointing to ptr and make it point to whichever one of ptr’s children is not the node with the string we want to remove. We then delete the node containing the string we want to delete and then we delete ptr as well, completing the removal.





We’re now on the home stretch!  All that’s left is levelOrder and the duo of functions known as clear and ~pTree!  All of them should be familiar to you when you compare them to their counterparts in binary search trees.
//For levelOrder to work, pass head as ptr and pass true for isRoot.
void pTree::levelOrder(pNode* ptr, bool isRoot)
{
    if(isRoot)
        nodes.push(ptr);
    if(ptr -> zero != NULL)
        nodes.push(ptr->zero);
    if(ptr -> one != NULL)
    {
        nodes.push(ptr->one);
    }
    if(nodes.front()->isLeaf)//1
    {
        for(unsigned int i = 0; i < nodes.front()->data.length(); i++)
            cout<data.at(i);
        cout<<' ';
    }
    else
    {
        cout<index << ' ';
    }
    nodes.pop();
    if(!nodes.empty())
        levelOrder(nodes.front(), false);
    else
        cout<<” ”;//This is supposed to be an endl, but endl isn’t compatible with Blogger.
}

1. The only thing in levelOrder that really changed is that now, we are dealing with two types of data to print. This if statement and the else statement make up the printing section of the function. If the current node is a leaf, then we will print its data (by the way, in order to print that string we need to print it one character at a time since string variables are incompatible with cout). Otherwise, we will print its index.



Now, here’s the clear function and the destructor:

//For clear to work properly, pass head as ptr.

void pTree::clear(pNode *ptr)
{
    if(!isEmpty())
    {
        if(ptr -> zero != NULL)
            clear(ptr -> zero);
        if(ptr -> one != NULL)
            clear(ptr -> one);
        delete ptr;
    }
    root = NULL;
}
pTree::~pTree()
{
    clear(root);
}

As you can see, the clear function and destructor are pretty much the same as the clear and destructor functions in binary search trees, doing a postorder traversal and deletion of every node in the tree.  Also, please note that the traversal only happens if the tree is not empty so that if the tree is empty a segfault doesn't happen due to accessing a node that isn't there.




You now know how to implement a Noobtricia tree. While this is not a real Patricia tree, every function of the real thing works the same way as the corresponding function of a Noobtrica tree, with a few differences you will see when learning about Patricia trees such as:


-Instead of passing a string and a corresponding array of boolean variables, you are passing a C-string and accessing its bits with the bitwise & operator (not covered in Stranded In Halligan).


-Since you are passing a C-string, you are not limited by array size anymore.


-You won’t be totally clueless when you learn about the real thing since you will have already learned just about every concept of Patricia trees in Noobtricia trees.






CONGRATULATIONS! You now know all of the concepts of the hardest data structure taught in COMP 15!

1 comment :

  1. should A's ascii code be 0x41 (0100 0001)? But in your article it's 0x30 (0011 0000).

    ReplyDelete