Skip to content

Commit 1f0bd3a

Browse files
author
Hamid Gasmi
committed
Suffix Array + Exact and Approximate Patterns Match
1 parent 4e269e4 commit 1f0bd3a

File tree

1 file changed

+78
-11
lines changed

1 file changed

+78
-11
lines changed

README.md

+78-11
Original file line numberDiff line numberDiff line change
@@ -3150,12 +3150,12 @@
31503150
## String Processing and Pattern Matching Algorithms
31513151

31523152
<details>
3153-
<summary>Trie: Multiple Pattern Matching</summary>
3153+
<summary>Trie: Multiple Exact Pattern Matching</summary>
31543154

3155-
- Multiple Patterns Matching:
3156-
- Where do billions of string patterns (reads) match a string Text (reference genome)?
3155+
- Multiple Exact Patterns Matching:
3156+
- Where do billions of string patterns (reads) match a string *Text* (reference genome)?
31573157
- Input: A set of strings Patterns and a string Text
3158-
- Output: All positions in Text where a string from Patterns appears as a substring
3158+
- Output: All positions in *Text* where a string from *Patterns* appears as a substring
31593159
- Implementation, Time Complexity and Operations:
31603160
- For a collection of strings *Patterns*, *Trie*(*Patterns*) is defined as follows:
31613161
- The trie has a single root node with indegree 0
@@ -3227,7 +3227,7 @@
32273227
</details>
32283228

32293229
<details>
3230-
<summary>Suffix Trie: Multiple Pattern Matching</summary>
3230+
<summary>Suffix Trie: Multiple Exact Pattern Matching</summary>
32313231

32323232
- It's denoted ***SuffixTrie(Text)***
32333233
- It's the trie formed from all suffixes of *Text*
@@ -3290,7 +3290,7 @@
32903290
</details>
32913291

32923292
<details>
3293-
<summary>Suffix Tree: Multiple Pattern Matching</summary>
3293+
<summary>Suffix Tree: Multiple Exact Pattern Matching</summary>
32943294

32953295
- It's a compression of suffix-trie
32963296
- From the suffix-trie above, transform each branch to it word
@@ -3320,7 +3320,7 @@
33203320
- Exact Pattern Matches:
33213321
- Time Complexity: **O(|*Text*| + |*Patterns*|)**
33223322
- 1st we need O(|*Text*|) to build the suffix tree
3323-
- 2nd for each pattern *Pattern* in *Patterns* we need additional O(|*Pattern*|) to match this pattern against the Text
3323+
- 2nd for each pattern *Pattern* in *Patterns* we need additional O(|*Pattern*|) to match this pattern against the *Text*
33243324
- The total time for all the patterns is: O(|*Patterns*|),
33253325
- The overall running time: O(|*Text*|+|*Patterns*|)
33263326
- Space Complexity:
@@ -3352,9 +3352,9 @@
33523352
- It's also called **block-sorting compression**
33533353
- It rearranges a character string into runs of similar characters
33543354
- It's usefull for compression
3355-
- Text <---> BWT-Text = BWT(Text) <---> Compression(BWT-Text)
3355+
- *Text* <---> BWT-Text = BWT(*Text*) <---> Compression(*BWT-Text*)
33563356
- BWT:
3357-
- From Text to BWT: Text ---> BWT-Text ---> Compressed BWT-Text
3357+
- From *Text* to BWT: *Text* ---> *BWT-Text* ---> Compressed *BWT-Text*
33583358
- Forming All Cyclic Rotations of a text ---> Sorting Cyclic Rotations ---> String last column
33593359
- E.g. `AGACATA$`:
33603360
- v
@@ -3452,6 +3452,7 @@
34523452
<details>
34533453
<summary>Burrows-Wheeler Transform: Pattern Matching</summary>
34543454

3455+
- It doesn't return the position in *Text* where *Pattern* is matching *Text*
34553456
- BW Matching:
34563457
- E.g. BWT: `ATG$C3A` (original text: `AGACATA$`):
34573458
- Let's search for `ACA`
@@ -3539,10 +3540,76 @@
35393540
</details>
35403541

35413542
<details>
3542-
<summary>Suffix Arrays</summary>
3543+
<summary>Suffix Arrays: Pattern Matching</summary>
35433544

3544-
- Implementation, Time Complexity and Operations:
3545+
- **Suffix Arrays**: It holds starting position of each suffix beginning a row
3546+
- E.g. `AGACATA$`:
3547+
- Array Suffix: BW Matrix:
3548+
7 $AGACATA
3549+
6 A$AGACAT
3550+
2 ACATA$AG
3551+
0 AGACATA$
3552+
4 ATA$AGAC
3553+
3 CATA$AGA
3554+
1 GACATA$A
3555+
5 TA$AGACA
3556+
- Implementation: DFS traversal of the corresponding Suffix Tree
3557+
- Space and Time Complexities:
3558+
- DFS traversal of a suffix tree: O(|*Text*|) time and ~20 x |*Text*| (see Suffix Tree section, above)
3559+
- Manber-Myers algorithm (1990): O(|*Text*|) time and ~4 x |*Text*| space
3560+
- Memory footprint is still large (for human genome, particularly)!
3561+
- Exact Pattern Matching:
3562+
- E.g. BWT: `ATG$C3A` (original text: `AGACATA$`) and Pattern: `ACA`
3563+
- Array Suffix:
3564+
7 top $1-----A1
3565+
6 \ A1-----T1
3566+
2 --> A2-----G1
3567+
0 / A3-----$1
3568+
4 bott A4-----C1
3569+
3 C1-----A2
3570+
1 G1-----A3
3571+
5 T1-----A4
3572+
- To reduce the memory footprint:
3573+
- 1st. We could keep in the suffix array values that are multiples of some integer *K*
3574+
- 2nd. Use First-Last Property to find the position of the pattern
3575+
- E.g. BWT: `ATG$C3A` (original text: `AGACATA$`), Pattern: `ACA`, and `K = 5`
3576+
- Suffix Array:
3577+
_ top $1-----A1 4. Not in Suffix Array but Pos($1) = Pos(A1) + 1
3578+
_ \ A1-----T1 5. Not in Suffix Array but Pos(A1) = Pos(T1) + 1
3579+
_ --> A2-----G1 1. Not in Suffix Array but we know that Pos(A2) = Pos(G1) + 1
3580+
0 / A3-----$1 3. Not in Suffix Array but Pos(A3) = Pos($1) + 1
3581+
_ bott A4-----C1
3582+
_ C1-----A2
3583+
_ G1-----A3 2. Not in Suffix Array but Pos(G2) = Pos(A3) + 1
3584+
5 T1-----A4 6. Pos(T1) = 5
3585+
Pos(T1) = 5
3586+
Pos(A1) = Pos(T1) + 1 = 6
3587+
Pos($1) = Pos(A1) + 1 = 7
3588+
Pos(A3) = Pos($1) + 1 = 8 = 0
3589+
Pos(G1) = Pos(A3) + 1 = 1
3590+
Pos(A2) = Pos(G1) + 1 = 2
3591+
- Space Complexity: ~4/K x |*Text*| space with Manber-Myers algorithm
3592+
- Matching Pattern running Time:
3593+
- It's multiplied by x *K*
3594+
- But since *K* is a constant, the running time unchanged
3595+
- Approximate Pattern Matching:
3596+
- Input: A string *Pattern*, a string *Text*, and an integer *d*
3597+
- Output: All positions in *Text* where the string *Pattern* appears as a substring with at most *d* mismatches
3598+
- Multiple Approximate Pattern Matching:
3599+
- Input: A set of strings *Patterns*, a string *Text*, and an integer *d*
3600+
- Output: All positions in *Text* where a string from *Patterns* appears as a substring with at most *d* mismatches
3601+
- E.g. BWT: `ATG$C3A` (original text: `AGACATA$`) and *Pattern*: `ACA` and *d*: 1
3602+
- Mismatch # Mismatch # Mismatch # Array Suffix
3603+
$1------A1 $1------A1 $1------A1 7
3604+
t ->A1------T1 1 A1------T1 A1------T1 6_
3605+
A2------G1 1 A2------G1 t->A2------G1 0 2 \
3606+
A3------$1 1 A3------$1 A3------$1 1 0 | Approx. Match
3607+
b ->A4------C1 0 A4------C1 b ->A4------C1 1 4_ / at {0, 2, 4}
3608+
C1------A2 t ->C1------A2 0 C1------A2 3
3609+
G1------A3 G1------A3 1 G1------A3 1
3610+
T1------A4 b ->T1------A4 1 T1------A4 5
35453611
- Related Problems:
3612+
- [Construct the Suffix Array of a String](https://github.com/hamidgasmi/training.computerscience.algorithms-datastructures/issues/160)
35463613
- For more details:
35473614
- UC San Diego Course:[Suffix Arrays](https://github.com/hamidgasmi/training.computerscience.algorithms-datastructures/blob/master/5-string-processing-and-pattern-matching-algorithms/2-burrows-wheeler-suffix-arrays/02_bwt_suffix_arrays.pdf)
35483615

0 commit comments

Comments
 (0)