%% JUKESCANTOR Example of Jukes-Cantor model of a sequence evolution
%%
% This demo belongs to the collection of Case Studies in Computational
% Genomics, mostly based on classic papers, and mostly based on the contents
% of the book
%
% Introduction to Computational Genomics, A Case Studies Approach
% Cambridge University Press, 2006
% Nello Cristianini and Matthew Hahn
%
% The demos of the other case studies, pointers to software and papers that
% are available on-line can be found on the website:
%
% www.computational-genomics.net
%
% This demo is also available on-line at
web('http://www.computational-genomics.net/case_studies/juckescantor_demo.html');
% Demo by Elisa Ricci.
%% Introduction
% One of the crucial tasks in computational genomics is represented by the
% estimation of the genetic distance between two homologous sequences, that
% is the number of of substitutions that have accumulated between them since
% they diverged from their common ancestor.
% The problem is not easy since it means not simply count the number of
% position at which the two sequence differ. This underestimate the true
% genetic distance due to multiple substitutions occurred at the same site.
% To solve this problem, a common approach is to correct the observed genetic
% distance between two sequences by using a probabilistic model.
% Several models have been developed in the past. The simplest (the well known
% Jukes-Cantor model) assumes that each substitution has the same probability.
% In this demo we show the effect of the JC correction.
%% Observed genetic distance
% Starting with a random sequence of length 1000, you must perform one
% random substitution at each of the 2000 epochs.
L=1000;
M=2000;
%%
% In order to get statistics you may average all over 10 experiments.
R=10;
%%
% A random substitution is performed for each experiment every epoch and
% the difference between the current sequence and the original one are
% calculated.
for n=1:R
seq=randseq(L);
seq1=seq;
for i=1:M
j=floor(rand*(L-1))+1;
seq1(j)=randseq(1);
v(i,n)=L-length(find((seq-seq1)==0)); %OBSERVED differences
end
end
%%
% Now you can perform some statistics: the mean and the standard deviation
% of the differences are calculated.
mm=mean(v');
ss=std(v');
%% Jukes Cantor correction
% You must apply the JC correction to the observed differences between the
% two sequences. An estimate of the genetic distance and of its standard
% deviation is obtained.
p=mm/L;
E=sqrt(p.*(1-p)./(L*((1-(4/3).*p).^2)));
d = -3/4 * log(1 - p * 4/3);
%%
% The error bars for the two cases are computed.
figure;
subplot(2,1,1), errorbar(mm,ss)
title('Simulating the Effect of Multiple Substitutions')
xlabel('True Number of Substitutions');
ylabel('Observed Num. of Substitutions')
subplot(2,1,2), errorbar(d*L,E)
xlabel('True Number of Substitutions');
ylabel('Estimated Num. of Substitutions')
%%
% From the plot you may notice that after 2000 substitutions, less than
% half are actually observable. Moreover applying the JC correction it
% turns out an almost linear relation between mean and std deviation. This
% is due to the fact that the JC model consider equal probability of
% substitution among all symbols.