C++ Coding Test Course, DNA Password

1. Problem Definition

Recently, a biologist found themselves in a situation where they needed to create a specific password using DNA sequences. This DNA sequence consists of four types of nucleotides: A, C, G, and T. The password is generated from a combination of A, C, G, and T selected from the given DNA sequence.

Problem Description

Write a program to find the substring of length K among all substrings of the given string S where the frequency of each nucleotide is minimized, and determine the number of valid passwords among them. If all frequencies are the same, select the smallest lexicographic string among those with the minimum frequency. The length of the substring must be less than or equal to the given string.

Input

  • The first line contains the length N of the DNA string S (1 ≤ N ≤ 100,000).
  • The second line contains the integer K (1 ≤ K ≤ N).
  • The third line contains the string S.

Output

Print the number of valid passwords and the smallest lexicographic password among them.

2. Algorithm Analysis

To solve this problem, we will use the Sliding Window technique. This method allows us to move through substrings of a specific length K while counting the frequency of each character (A, C, G, T) to solve the problem.

Procedure

  1. Select the first substring of length K and count the frequency of characters.
  2. Then, adjust the frequency as the sliding window moves, adding the new character and removing the old one.
  3. Check the frequency of nucleotides in each K-length substring and find the smallest frequency.
  4. After inspecting all substrings, print the count of valid passwords and the smallest lexicographic password.

3. C++ Coding Example

Below is a C++ code example to solve this problem:

        #include <iostream>
        #include <string>
        #include <unordered_map>
        #include <vector>
        #include <algorithm>

        using namespace std;

        int main() {
            int N, K;
            string S;
            cin >> N >> K;
            cin >> S;

            unordered_map freq;
            for (int i = 0; i < K; i++) {
                freq[S[i]]++;
            }

            // Initial result setup
            int min_freq = INT_MAX;
            vector passwords;

            // Sliding window to check all K-length substrings
            for (int i = 0; i <= N - K; i++) {
                if (i > 0) {
                    freq[S[i - 1]]--;
                    freq[S[i + K - 1]]++;
                }

                // Finding the smallest frequency
                int cur_min = INT_MAX;
                for (const auto &entry : freq) {
                    cur_min = min(cur_min, entry.second);
                }

                // Adding valid password
                if (cur_min <= (K / 4)) { // Frequency of all nucleotides is less than or equal to K/4
                    passwords.push_back(S.substr(i, K));
                }
            }

            // Finding the smallest lexicographic password
            sort(passwords.begin(), passwords.end());
            string smallest_password = (passwords.empty() ? "" : passwords[0]);
            
            // Output result
            cout << passwords.size() << " " << smallest_password << endl;

            return 0;
        }
    

4. Code Explanation

The above code works as follows:

  • First, the DNA string and its length, along with the length K of the substring, are input.
  • An unordered_map is used to count the frequency of the first K characters.
  • The Sliding Window technique is used to move through each K-length substring, modifying the frequency.
  • The minimum frequency is calculated for each substring, adding to the valid password list only when conditions are met.
  • Finally, the valid password list is sorted to find the smallest lexicographic password.

5. Test Cases

Now, let’s look at some test cases to solve this problem:

Test Case 1

        Input:
        8 4
        ACGTTGCA

        Output:
        1 ACGT
    

Test Case 2

        Input:
        12 5
        ACGTTACGATC

        Output:
        3 ACCTA ACTAC TCGAT
    

Test Case 3

        Input:
        6 3
        ACGTAC

        Output:
        2 ACG ACT
    

6. Conclusion

In this tutorial, we learned an algorithm utilizing the sliding window technique through the DNA password problem. This technique can be very useful in solving string-related problems and can be applied to various modified problems. It serves as a solid foundation that can be beneficial in coding tests.