skip to content

Word frequency distribution under the restriction avoidance

Wednesday 12th December 2001 - 14:30 to 15:30
INI Seminar Room 1
Session Title: 
Immunology, Ecology and Epidemiology (main themes)
Restriction enzymes of bacteria cut the unmodified recognition sites of DNA, thereby protect bacteria from the infection by phage. This imposes a strong selective pressure on the phage genome sequences. For example, the Bacillus phage phi1 genome has much fewer restriction sequences of Bacillus restriction enzymes than their random expectations. To answer the evolution of phage genome under the selection by bacteria restriction enzymes, a simple model for the evolution of binary or nucleotide sequences is proposed. The model shows that not only the frequency of the restricted sequence itself in the genome, but also that of the other subsequences (words) will be largely deviated from the random expectations. The pattern of word frequency distribution sensitively depends on the type of restricted sequence. If the restricted sequence is of the singlet repeat type (e.g. 000), the frequency of a 3-letter word is largely explained by its Hamming distance from the restricted sequence -- the farther is the Hamming distance of the word from the restriction sequence, the more is its abundance. However, quite unexpected word frequency distribution arises if the restricted sequence is of the other type (e.g. 001 or 010). The abundance of words is largely influenced by the vulnerability to restriction of partially overlapped adjacent words. For example, when 001 is the restriction sequence, both 000 and 100 become quite rare, whose frequencies approach to 0 as the genome size increases; whereas 101, which is only one-step distant from the restricted word, becomes quite common. A new theoretical framework is proposed to explain such peculiar distribution of words under restriction avoidance, which enable us to count the exact word frequency of any word in the 'feasible' sequences. A sequence is called feasible here if it contains no restriction site in any position (i.e. the feasible sequences are that survive under a strong selection). The fraction of feasible sequences in all random sequences exponentially decays towards zero as the genome size increases (e.g. if the restriction sequence is 001, the number of feasible sequences in all binary sequences of length n equals the (n+3)rd term of Fibonatti series minus 1, which gives the fraction of feasible sequences decaying with n as (0.81)^n). In the light of these results, I discuss the optimal recognition sequence of restriction enzyme to fight against phage, and the mutation and substitution loads and the fitness landscape of phage subject to such selection pressure imposed by a restriction enzyme.
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons