|
3150 | 3150 | ## String Processing and Pattern Matching Algorithms
|
3151 | 3151 |
|
3152 | 3152 | <details>
|
3153 |
| -<summary>Trie: Multiple Pattern Matching</summary> |
| 3153 | +<summary>Trie: Multiple Exact Pattern Matching</summary> |
3154 | 3154 |
|
3155 |
| -- Multiple Patterns Matching: |
3156 |
| - - Where do billions of string patterns (reads) match a string Text (reference genome)? |
| 3155 | +- Multiple Exact Patterns Matching: |
| 3156 | + - Where do billions of string patterns (reads) match a string *Text* (reference genome)? |
3157 | 3157 | - Input: A set of strings Patterns and a string Text
|
3158 |
| - - Output: All positions in Text where a string from Patterns appears as a substring |
| 3158 | + - Output: All positions in *Text* where a string from *Patterns* appears as a substring |
3159 | 3159 | - Implementation, Time Complexity and Operations:
|
3160 | 3160 | - For a collection of strings *Patterns*, *Trie*(*Patterns*) is defined as follows:
|
3161 | 3161 | - The trie has a single root node with indegree 0
|
|
3227 | 3227 | </details>
|
3228 | 3228 |
|
3229 | 3229 | <details>
|
3230 |
| -<summary>Suffix Trie: Multiple Pattern Matching</summary> |
| 3230 | +<summary>Suffix Trie: Multiple Exact Pattern Matching</summary> |
3231 | 3231 |
|
3232 | 3232 | - It's denoted ***SuffixTrie(Text)***
|
3233 | 3233 | - It's the trie formed from all suffixes of *Text*
|
|
3290 | 3290 | </details>
|
3291 | 3291 |
|
3292 | 3292 | <details>
|
3293 |
| -<summary>Suffix Tree: Multiple Pattern Matching</summary> |
| 3293 | +<summary>Suffix Tree: Multiple Exact Pattern Matching</summary> |
3294 | 3294 |
|
3295 | 3295 | - It's a compression of suffix-trie
|
3296 | 3296 | - From the suffix-trie above, transform each branch to it word
|
|
3320 | 3320 | - Exact Pattern Matches:
|
3321 | 3321 | - Time Complexity: **O(|*Text*| + |*Patterns*|)**
|
3322 | 3322 | - 1st we need O(|*Text*|) to build the suffix tree
|
3323 |
| - - 2nd for each pattern *Pattern* in *Patterns* we need additional O(|*Pattern*|) to match this pattern against the Text |
| 3323 | + - 2nd for each pattern *Pattern* in *Patterns* we need additional O(|*Pattern*|) to match this pattern against the *Text* |
3324 | 3324 | - The total time for all the patterns is: O(|*Patterns*|),
|
3325 | 3325 | - The overall running time: O(|*Text*|+|*Patterns*|)
|
3326 | 3326 | - Space Complexity:
|
|
3352 | 3352 | - It's also called **block-sorting compression**
|
3353 | 3353 | - It rearranges a character string into runs of similar characters
|
3354 | 3354 | - It's usefull for compression
|
3355 |
| - - Text <---> BWT-Text = BWT(Text) <---> Compression(BWT-Text) |
| 3355 | + - *Text* <---> BWT-Text = BWT(*Text*) <---> Compression(*BWT-Text*) |
3356 | 3356 | - BWT:
|
3357 |
| - - From Text to BWT: Text ---> BWT-Text ---> Compressed BWT-Text |
| 3357 | + - From *Text* to BWT: *Text* ---> *BWT-Text* ---> Compressed *BWT-Text* |
3358 | 3358 | - Forming All Cyclic Rotations of a text ---> Sorting Cyclic Rotations ---> String last column
|
3359 | 3359 | - E.g. `AGACATA$`:
|
3360 | 3360 | - v
|
|
3452 | 3452 | <details>
|
3453 | 3453 | <summary>Burrows-Wheeler Transform: Pattern Matching</summary>
|
3454 | 3454 |
|
| 3455 | +- It doesn't return the position in *Text* where *Pattern* is matching *Text* |
3455 | 3456 | - BW Matching:
|
3456 | 3457 | - E.g. BWT: `ATG$C3A` (original text: `AGACATA$`):
|
3457 | 3458 | - Let's search for `ACA`
|
|
3539 | 3540 | </details>
|
3540 | 3541 |
|
3541 | 3542 | <details>
|
3542 |
| -<summary>Suffix Arrays</summary> |
| 3543 | +<summary>Suffix Arrays: Pattern Matching</summary> |
3543 | 3544 |
|
3544 |
| -- Implementation, Time Complexity and Operations: |
| 3545 | +- **Suffix Arrays**: It holds starting position of each suffix beginning a row |
| 3546 | + - E.g. `AGACATA$`: |
| 3547 | + - Array Suffix: BW Matrix: |
| 3548 | + 7 $AGACATA |
| 3549 | + 6 A$AGACAT |
| 3550 | + 2 ACATA$AG |
| 3551 | + 0 AGACATA$ |
| 3552 | + 4 ATA$AGAC |
| 3553 | + 3 CATA$AGA |
| 3554 | + 1 GACATA$A |
| 3555 | + 5 TA$AGACA |
| 3556 | + - Implementation: DFS traversal of the corresponding Suffix Tree |
| 3557 | + - Space and Time Complexities: |
| 3558 | + - DFS traversal of a suffix tree: O(|*Text*|) time and ~20 x |*Text*| (see Suffix Tree section, above) |
| 3559 | + - Manber-Myers algorithm (1990): O(|*Text*|) time and ~4 x |*Text*| space |
| 3560 | + - Memory footprint is still large (for human genome, particularly)! |
| 3561 | +- Exact Pattern Matching: |
| 3562 | + - E.g. BWT: `ATG$C3A` (original text: `AGACATA$`) and Pattern: `ACA` |
| 3563 | + - Array Suffix: |
| 3564 | + 7 top $1-----A1 |
| 3565 | + 6 \ A1-----T1 |
| 3566 | + 2 --> A2-----G1 |
| 3567 | + 0 / A3-----$1 |
| 3568 | + 4 bott A4-----C1 |
| 3569 | + 3 C1-----A2 |
| 3570 | + 1 G1-----A3 |
| 3571 | + 5 T1-----A4 |
| 3572 | + - To reduce the memory footprint: |
| 3573 | + - 1st. We could keep in the suffix array values that are multiples of some integer *K* |
| 3574 | + - 2nd. Use First-Last Property to find the position of the pattern |
| 3575 | + - E.g. BWT: `ATG$C3A` (original text: `AGACATA$`), Pattern: `ACA`, and `K = 5` |
| 3576 | + - Suffix Array: |
| 3577 | + _ top $1-----A1 4. Not in Suffix Array but Pos($1) = Pos(A1) + 1 |
| 3578 | + _ \ A1-----T1 5. Not in Suffix Array but Pos(A1) = Pos(T1) + 1 |
| 3579 | + _ --> A2-----G1 1. Not in Suffix Array but we know that Pos(A2) = Pos(G1) + 1 |
| 3580 | + 0 / A3-----$1 3. Not in Suffix Array but Pos(A3) = Pos($1) + 1 |
| 3581 | + _ bott A4-----C1 |
| 3582 | + _ C1-----A2 |
| 3583 | + _ G1-----A3 2. Not in Suffix Array but Pos(G2) = Pos(A3) + 1 |
| 3584 | + 5 T1-----A4 6. Pos(T1) = 5 |
| 3585 | + Pos(T1) = 5 |
| 3586 | + Pos(A1) = Pos(T1) + 1 = 6 |
| 3587 | + Pos($1) = Pos(A1) + 1 = 7 |
| 3588 | + Pos(A3) = Pos($1) + 1 = 8 = 0 |
| 3589 | + Pos(G1) = Pos(A3) + 1 = 1 |
| 3590 | + Pos(A2) = Pos(G1) + 1 = 2 |
| 3591 | + - Space Complexity: ~4/K x |*Text*| space with Manber-Myers algorithm |
| 3592 | + - Matching Pattern running Time: |
| 3593 | + - It's multiplied by x *K* |
| 3594 | + - But since *K* is a constant, the running time unchanged |
| 3595 | +- Approximate Pattern Matching: |
| 3596 | + - Input: A string *Pattern*, a string *Text*, and an integer *d* |
| 3597 | + - Output: All positions in *Text* where the string *Pattern* appears as a substring with at most *d* mismatches |
| 3598 | +- Multiple Approximate Pattern Matching: |
| 3599 | + - Input: A set of strings *Patterns*, a string *Text*, and an integer *d* |
| 3600 | + - Output: All positions in *Text* where a string from *Patterns* appears as a substring with at most *d* mismatches |
| 3601 | + - E.g. BWT: `ATG$C3A` (original text: `AGACATA$`) and *Pattern*: `ACA` and *d*: 1 |
| 3602 | + - Mismatch # Mismatch # Mismatch # Array Suffix |
| 3603 | + $1------A1 $1------A1 $1------A1 7 |
| 3604 | + t ->A1------T1 1 A1------T1 A1------T1 6_ |
| 3605 | + A2------G1 1 A2------G1 t->A2------G1 0 2 \ |
| 3606 | + A3------$1 1 A3------$1 A3------$1 1 0 | Approx. Match |
| 3607 | + b ->A4------C1 0 A4------C1 b ->A4------C1 1 4_ / at {0, 2, 4} |
| 3608 | + C1------A2 t ->C1------A2 0 C1------A2 3 |
| 3609 | + G1------A3 G1------A3 1 G1------A3 1 |
| 3610 | + T1------A4 b ->T1------A4 1 T1------A4 5 |
3545 | 3611 | - Related Problems:
|
| 3612 | + - [Construct the Suffix Array of a String](https://github.com/hamidgasmi/training.computerscience.algorithms-datastructures/issues/160) |
3546 | 3613 | - For more details:
|
3547 | 3614 | - UC San Diego Course:[Suffix Arrays](https://github.com/hamidgasmi/training.computerscience.algorithms-datastructures/blob/master/5-string-processing-and-pattern-matching-algorithms/2-burrows-wheeler-suffix-arrays/02_bwt_suffix_arrays.pdf)
|
3548 | 3615 |
|
|
0 commit comments