# 6.897 Advanced Data Structures Spring

6.897: Advanced Data Structures

Spring 2005

Lecture 18 — April 12, 2005
Prof. Erik Demaine Scribe: Igor Ganichev

1

Overview

In this lecture we are starting a sequence of lectures about string data structures. Today’s lecture will be on the string matching problem. In particular, we will consider Su?x Trees and their various applications. In the string matching problem we are given an alphabet Σ, text T , and a pattern P , and ask various questions such as: – Is there a substring of T matching P ? – How many substrings of T match P ? – Where are ?rst/any k occurrences of P in T ? – Where are all occurrences of P in T ? There are two di?erent approaches to solving such problems. Algorithmic approach solves the a new instance of problem every time the algorithm is run. There are a number of well known algorithms that achieve linear time in the size of T such as Karp-Rabin and Knuth-Morris-Pratt. In the data structural approach we can preprocess T , and then we can answer a query involving a pattern P much faster. We will see today how to answer queries in O(|P |) time using O(|T |) space and O(|T |) preprocessing time. It is interesting to note that space is linear in the number of words, not the number of bits, as one might hope for. We will discuss how to achieve linear space in the number of bits during the lecture on succinct data structures.

2

Relevant Concepts

Tries. A trie is a tree with children branches labeled with distinct letters from Σ. The branches are ordered alphabetically. A trie can be built for any set of strings for any alphabet. The branching factor of every internal node is |Σ|. For convenience, we will append a dollar sign, \$, to the end of all strings. We can reuse an old idea of coalescing non-branching paths to reduce the number of edges to be at most twice the number of leaves (which is the number of strings we build the trie on). This yields a compact trie. Figure 1 gives an example trie for the set of strings {ana, ann, anna, anne}, as well as its compact version.

1

a n a \$
ana

an n a\$ a e \$
anna ana

n \$ a\$ e\$
anna anne

\$
ann

\$
anne

ann

Trie
(Pointers to nulls are not shown)

Compacted Trie

Figure 1: Trie and Conmpacted Trie Examples

Su?x Arrays. A Su?x Array A of T is just a sorted array of su?xes of T . To avoid quadratic space we store only the indices of the su?xes, instead of full su?xes of T . For example, if T is equal to “banana\$”, the su?x array is [6, 5, 3, 1, 0, 4, 2] — corresponding to the alphabetical ordering of the su?xes { \$, a\$, ana\$, anana\$, banana\$, na\$, nana\$ }. We can search for occurrences of P directly on the su?x array using binary search in O(|P | lg |T |) time. This can be improved to O(|P | + lg |T |). Longest Common Pre?x Array. We can consider an LCP array of size |T | ? 1, in which the ith element is the length of the longest common pre?x of A(i) and A(i + 1), where A is the su?x array of T . For the “banana” example above this array will be [0 1 3 0 0 2]. Su?x Trees. The su?x tree of text T is a compacted trie on all the su?xes of T . For example, if our text is “banana\$”, the su?x tree is a compacted trie built for the set { banana\$, anana\$, nana\$, ana\$, na\$, na\$, a\$, \$ }. It is common to refer to a su?x by the index (starting at 0) of its ?rst character. To get linear space, all we have to do is not store the labels on edges explicitly, but, instead, store two indices: the position in T of the ?rst and last characters of the label. Note that this is possible because each label is a substring of a su?x of T , thus is a substring of T . This way we have constant space for each edge, and the total complexity of the su?x tree is linear in the size of T (in words). Su?x trees are important because they are very easy to query. Given a pattern P , all occurrences of P in T can be found and reported in time O(|P | + output). This is done by simply walking 2

down the su?x tree, always taking the edge that corresponds to the next character in P . Note that edges below any node di?er in their ?rst character, so it is easy to select which edge we want. Then, we can compare the succesive characters in the pattern with the label. When we’ve explored the path R starting from the root such that labels on its edges give us P when concatenated, all the occurrences of P in T are in the subtree whose root is lowest node in R.

3

Constructing Su?x Trees

We now show how to construct su?x trees in linear time. There are many known algorithms for this, starting with classic solutions by Weiner and McCreight in the mid 70s. We will look at a recent construction algorithm found by K¨rkk¨inen and Sanders [1], which is surprisingly short and a a clean. This algorithm actually builds su?x arrays, but we show how to build su?x trees from the su?x array and LCP array.

3.1

Construction of The Su?x Tree from The LCP Array and The Su?x Array

Let’s start by building up intuition on how the su?x tree S, the LCP array L, and the su?x array A are related to each other. First, notice that to get A from S, we can simply do an in-order walk of A. Furthermore, notice that zeros in L correspond to visits of the root in this in-order walk. But what do the other numbers in L mean? They are the “letter depth” of the internal nodes of the su?x tree. (By “letter depth of a node” we mean its depth in an non-compacted su?x tree, i.e. the number of letters that the path to it form the root contains). Finally, once we have the internal nodes of the su?x tree, we can distribute the edge labels in one pass of in-order walk since we know the order of the su?xes. With the intuition in place, here is how we build the su?x tree from the LCP array and the su?x array. In a previous lecture, we saw how to build a Cartesian tree from an array in linear time (by adding nodes in sorted order one by one and walking from the bottom to the place where the new node should be attached). It remains to realize that S is the Cartesian tree of L! Adding labels is straightforward. See Figure 1.

3.2

Construction of the LCP and Su?x Arrays

Let us introduce some notations. T [i :] denotes the su?x of T starting at index i; <> denotes an array, (a, b) denotes an ordered pair, and (a, b, c) denotes an ordered triple. We will use ? sign to = denote di?erent representations of the same data. How to transform one representation to another will be always obvious. The algorithm below will construct the LCP and Su?x arrays in O(|T |+ sort(Σ)) time. We explain each step of the algorithm. 1. Sort Σ. In the ?rst iteration use any sorting algorithm, leading to the O(sort(Σ)) term. In the following iterations use radix sort to sort in linear time (see below). 2. Replace each letter in the text with its rank among the letters in the text. Note that the rank of the letter depends on the text. For example, if the text contains only one letter, no matter what

3

6 5 3 1 0 4 2

\$ 0 a\$ 1 ana\$ 3 anana\$ 0 banana\$ 0 na\$ 2 nana\$
Correspond to the letter depth of internal nodes in the in-order walk of the suffix tree Correspond to visits to the voot in the in-order walk of the suffix tree

Suffix Tree built from LCP array using Cartesian Trees

\$

000
a b

na

\$
1
\$

2

banana\$
\$ na

na\$

na\$
na\$

nana\$

a\$
\$

3

ana\$
Suffix Array Sorted suffixes of T LCP Array Letter depth

anana\$
We get the tree from LCP Array and the labels on the edges from Suffix Array

Figure 1: Suffix Array and LCP Array examples and their relation to Suffix Tree

letter it is, it will be replaced by 1. Note that this is safe to do, because it does not change any relations we are interested in. 3. Divide the text T into 3 parts and consider triples of letters to be one letter, i.e. change the alphabet. More formally, form T0 , T1 , and T3 as follows: T0 = < (T [3i ], T [3i + 1], T [3i + 2]) for for for i = 0, 1, 2, . . . > i = 0, 1, 2, . . . > i = 0, 1, 2, . . . >

T1 = < (T [3i + 1], T [3i + 2], T [3i + 3]) T2 = < (T [3i + 2], T [3i + 3], T [3i + 4])

Note that Ti ’s are just texts with n/3 letters of the new Σ3 alphabet. 4. Recurse on < T0 , T1 >. Since our alphabet is just of cubic size, radix sorting the new alphabet will take linear time in the next recursion. When this recursive call returns, we have all the su?xes of T0 and T1 sorted in a su?x array. Then all we need is to sort the su?xes of T2 , and merge them with the old su?xes, because Suffixes(T ) ? Suffixes(T0 ) ∪ Suffixes(T1 ) ∪ Suffixes(T2 ) = If we do this sorting and merging in linear time, we get a recursion formula T (n) = T (2/3n)+ O(n), which gives linear time as desired. 5. Sort su?xes of T2 using radix sort. This is straight forward to do once we note that T2 [i :] ? T [3i + 2 :] ? (T [3i + 2], T [3i + 3 :]) ? (T [3i + 2], T1 [i + 1 :]). = = = 4

Thus, the radix sort is just on two coordinates, and the second one is already sorted. After sorting, we can create an LCP array for T2 . To do this, simply check if each of T [3i + 2] is equal to the ?rst letter of the preceeding su?x. If they are distinct, the LCP between them is 0. Otherwise, we just lookup the LCP of corresponding T1 [i + 1 :]’s and add 1 to it. 6. Merge the sorted su?xes of T0 , T1 , and T2 . We use standard linear merging. The only problem is ?nding a way to compare su?xes in constant time. Remember that su?xes of T0 and T1 are already sorted together, so comparing a su?x from T0 and a su?x from T1 takes constant time. To compare against a su?x from T2 , we will again decompose it to get a su?x from either T0 or T1 . There are two cases: ? Comparing T0 against T2 : T0 [i :] ? T [3i :] = (T [3i], T [3i + 1 :]) (T [3i], T1 [i :]) vs vs vs vs T2 [j :] T [3j + 2 :] (T [3j + 2], T [3j + 3 :]) (T [3j + 2], T0 [j + 1 :])

? = ? =

So we just compare the ?rst letter and then, if needed, compare already sorted su?xes of T0 and T1 . ? Comparing T1 against T2 : ? T [3i + 1 :] = (T [3i + 1], T [3i + 2], T [3i + 3 :]) (T [3i + 1], T [3i + 2], T0 [i + 1 :]) T1 [i :] vs vs vs vs T2 [j :] T [3j + 2 :] (T [3j + 2], T [3j + 3], T [3j + 4 :]) (T [3j + 2], T [3j + 3], T1 [j + 1 :])

? = ? =

So we just compare the ?rst two letters and then, if needed, compare already sorted su?xes of T0 and T1 . Finally, we have to take care of the LCPs. This can be done exactly as above. Compare the ?rst two characters of neighboring su?xes. If either the ?rst pair or the second pair are di?erent, the LCP is 0 or 1. If both are equal, then get the already computed LCP of the su?xes starting from third letter and add 2 to it. To summarize at a high level, given text, we conpute its su?x and LCP arrays using the algorithm from above, then we construct the su?x tree using a Cartesian tree on the LCP array. Finally, we answer queries by walking down the su?x tree and returning all the su?xes located in the subtree rooted where the search ended.

4

Applications

There are quite a number of applications and variations of su?x trees: ? if we are intersted in counting the number of occurrences of P , we can augment su?x trees by storing subtree-sizes at every node 5

? if we want to know the longest repeated substring in the text, we just need to ?nd the deepest (in the “letter depth” sense) internal node in the su?x tree. ? multiple documents can be easily combined by introducing indexed dollar sings between texts: T1 \$1 T2 \$2 . . . ? if we want the longest common substring between two documents, we combine them as above, and ?nd the deepest node with both \$1 and \$2 in its subtree. Document Retrieval. In the document retrieval problem (see [2]), we have a collection of documents and a pattern P and we want to ?nd all distinct documents that contain P . We want a time bound of O(|P | + k), where k is the number of documents to report. This cannot be achieved through a simple search, because we may be reporting many matches inside the same document. Nonetheless, we can still solve the problem using su?x trees, in combination with range minimum queries (RMQ). As above, we concatenate documents separating them with indexed dollar signs, and build a common su?x tree S. Consider a su?x array A of the concatenated texts (we can get it by an in-order walk of the su?x tree). When we search for P in S, we get a subtree, which corresponds to an interval [i, j] in A. Notice that the documents that contain P are exactly the documents such that [i, j] contains at least one dollar sign with its index. Let each dollar sign store a pointer (an index into the su?x array) to the previous dollar sign of its type. Our goal now is to ?nd the dollar signs in [i, j] with a pointer before i; these are the ?rst occurences of each type, so they do not contain repeats. To accomplish this, we ?nd the position k of the minimum element in [i, j]. If the element points to something ≥ i, we are done. Otherwise, we output that document, and recurse on [i, k ? 1] and [k + 1, j]. Longest Palindrome. Other interesting applications can be found if we combine su?x trees with lowest common ancestor (LCA) queries. Notice that the LCA of two leaves is the longest pre?x match of the two su?xes. Using this, we can ?nd the longest palindrome in the string in O(|T |) time. The longest common pre?x of T [i :] and reverse(T )[?i :] gives the longest palindrome centered at i. This can be computed in constant time using an LCA query, so we can ?nd the longest palindrome overall in linear time.

References
[1] Juha K¨rkk¨inen, Peter Sanders, Simple linear work su?x array construction, In Proc. 30th Ina a ternational Colloquium on Automata, Languages and Programming (ICALP’03). LNCS 2719, Springer, 2003, pp. 943-955 [2] S. Muthukrishnan: E?cient algorithms for document retrieval problems. SODA 2002: 657-666

6

6.897 Advanced Data Structures Spring.pdf
6.897 Advanced Data Structures Spring_专业资料。In the last lecture we discussed the unified property and the unified structure. We started to prove that...
lec15Advanced Data Structures_IT/计算机_专业资料。计算机算法 6.897: Advanced Data Structures Spring 2005 Lecture 15 March 31, 2005 Prof. Erik Demaine...
Advanced Data Structures Spring Semester, 19934.pdf
Advanced Data Structures Spring Semester, 19934_专业资料。The Stratified--Tree...6页 免费 6.897 Advanced Data St... 暂无评价 6页 免费 ...
6.851 Advanced Data Structures Spring.pdf
6.851 Advanced Data Structures Spring_专业资料。Although sorting isn’t a ...6.897 Advanced Data St... 暂无评价 6页 免费 Advanced Data Structur......
2 A Lower Bound on Dynamic Connectivity 2.1 The Cel....pdf
6.897: Advanced Data Structures Spring 2
Link-cut_Trees - 6.851: Advanced Data Structures Spring 2012 Lecture 19 May 1, 2012 Prof. Eri...
COMP282 Advanced Data Structures_图文.pdf
COMP282 Advanced Data Structures Lecture 04 Graphs Implementation and Traversal...Designing Data Structu... 12页 免费 6.897 Advanced Data St... 暂无...
lec02.succinct data structures.pdf
Advanced Data Structures Succinct Data Structures Arbitrary Ordered Trees ?? ...6 7 8 9 10 11 12 13 14 15 16 17 Rank/Select on a bit vector ...
6.897 Advanced Data Structures Spring.pdf
6.897 Advanced Data Structures Spring_专业
6.897 Advanced Data Structures Spring.pdf
6.897 Advanced Data Structures Spring_专业
6.851 Advanced Data Structures Spring.pdf
6.851 Advanced Data Structures Spring_专业资料。Last lecture we discussed the...暂无评价 9页 免费 6.897 Advanced Data St... 暂无评价 6页 免费 ...
6.851 Advanced Data Structures Spring.pdf
6.851 Advanced Data Structures Spring_专业资料。In the last lecture we ...6.897 Advanced Data St... 暂无评价 6页 免费 2018 Baidu |由 百度云...

(Advanced Data Structures) B-树(B-Trees) 二项式...6. 矩阵本征值计算的 QR 算法 195961 年,伦敦...8. 快速 Fourier 变换 1965 年,IBM 的 T. J. ...

Spring.pdf
University of Maryland Center for Advanced Computer...Data Structures 3 JOURNAL ARTICLES Under Review or...(Digital Forensics, Spring 2004); Meenal Jadhav ...
TEACHING EXPERIENCE.pdf
(Advanced Placement C++, BASIC), with 4300+ ...C++ programming and introductory data structures. Stonehill...(Spring 2002; Prof. Lenhart Schubert), Design ...
6.851 Advanced Data Structures Spring.pdf
6.851 Advanced Data Structures Spring_专业资料。In the last lecture we ...6.897 Advanced Data St... 暂无评价 8页 免费 spring6种配置datasourc......

IELTS 述 至少 6.5 分 课程信息 Major Courses ...Advanced Techniques Computer Ethics Data Structures ...8 学校类型:公立 - 基础类 所在地:加拿大 魁北克...

(Advanced Data Structures and Algorithms
metaphor.doc
spring finger The fingers of a bit The guide ...S. Oxford Advanced Learner’s English-Chinese ...Linguistic and Literary Structures of the Chinese ...