Many people are solving algorithm problems to prepare for coding tests. In this course, we will explore the DNA password problem and explain step by step how to solve it.
Problem Description
The DNA password consists of a specific pattern of combinations. The problem is to find how many substrings from the given DNA string can become passwords. Here are the details of the problem.
Problem Definition
From the given DNA string, count the number of distinct substrings that contain only the characters “””ACGT””” and are of length K or more.
Input
- The first line contains the DNA string. (1 ≤ string length ≤ 1000)
- The second line contains the value of K. (1 ≤ K ≤ 100)
Output
Output the total number of distinct substrings.
Problem Solving Process
To solve this problem efficiently, we will follow these steps:
Step 1: Generate Substrings
Generate all substrings of length K or more from the given DNA string. To do this, we will iterate over the starting indexes of the string and extract substrings of length K or more from each starting index.
Step 2: Store Unique Substrings
Store the generated substrings in a set. Since a set does not allow duplicates, only distinct substrings will be stored. We can utilize the set
data type in this process.
Step 3: Output Result
Finally, we count the number of substrings stored in the set and output the result. Now let’s convert these processes into code.
Python Code
def count_unique_dna_substrings(dna: str, k: int) -> int:
unique_substrings = set() # Set to store substrings
n = len(dna)
# Extract substrings for all starting positions in the string
for start in range(n):
for end in range(start + k, n + 1): # Only consider lengths >= K
substring = dna[start:end]
unique_substrings.add(substring)
return len(unique_substrings) # Return the count of unique substrings
# Input
if __name__ == "__main__":
dna_string = input("Please enter the DNA string: ")
k_value = int(input("Please enter the length K: "))
# Calculate and print the number of unique substrings
result = count_unique_dna_substrings(dna_string, k_value)
print(f"The number of distinct substrings is: {result}")
Code Explanation
The function count_unique_dna_substrings
used in the above code takes two parameters: the dna
string and the k
value. This function first creates an empty set to prepare to store substrings. It then extracts substrings through a loop that will be explained in the next step.
Loop Explanation
The first loop iterates over the starting index (start
) of the DNA string. The second loop extracts substrings of length K or more starting from the start
index, thus determining the end
index. Each substring is added to the set, only storing distinct cases.
Example for Understanding
For example, if the input DNA string is ACGTACGT
and the value of K is 2, the possible substrings that can be generated are as follows:
- AC, ACG, ACGT, ACGTA, ACGTAC, ACGTACGT
- CG, CGT, CGTA, CGTAC, CGTACG
- GT, GTA, GTAC, GTACG
- TA, TAC, TACG
- AC, ACG, ACGT
In total, there will be 14 distinct substrings.
Performance Considerations
The above code checks all substring lengths for every starting index, resulting in a worst-case time complexity of O(n^3). Performance issues may occur when the string length reaches 1000. To enhance performance in extreme cases, we can consider applying the Sliding Window
technique.
Application of Sliding Window Technique
Using the Sliding Window technique allows us to easily calculate the number of substrings with length >= K in a single pass. It means storing information about the length of the string during one pass to facilitate exploration. This technique can reduce time complexity to O(n^2) or O(n).
Conclusion
In this course, we learned how to handle substrings of a string and resolve duplication issues using sets through the DNA password problem. Such problems frequently appear in coding tests, so it is recommended to be well-versed in basic string handling methods. In the next course, we will cover another useful algorithm problem.