%% LAMBDAPHAGE Example of sequence statistics and segmentation with MATLAB
% This demonstration looks at some statistics about the DNA content of the
% Lambda Phage and shows an example of segmentation of a sequence.
%%
% This demo belongs to the collection of Case Studies in Computational
% Genomics, mostly based on classic papers, and mostly based on the contents
% of the book
%
% Introduction to Computational Genomics, A Case Studies Approach
% Cambridge University Press, 2006
% Nello Cristianini and Matthew Hahn
%
% The demos of the other case studies, pointers to software and papers that
% are available on-line can be found on the website:
%
% www.computational-genomics.net
%
% This demo is also available on-line at
web('http://www.computational-genomics.net/case_studies/lambdaphage_demo.html');
% Demo by Elisa Ricci.
%% Introduction
% Phages are viruses that infect bacteria, and Bacteriophage lambda infects
% the bacterium Escherichia coli, a very well studied model system.
% Bacteriophage lambda was the one of the first viral genomes to be
% completely sequenced (1982). It contains about 48502 bases. The Genome
% repository at the NCBI contains more interesting information about it.
web('http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genome&cmd=Retrieve&dopt=Overview&list_uids=10119');
%%
% The dna sequence can be obtained from the GenBank database with the
% accession number NC_001416. Using the *getgenbank* function with the
% 'SequenceOnly' flag just the nucleotide sequence is loaded into the
% MATLAB workspace.
BLambda = getgenbank('NC_001416','SequenceOnly',true);
%%
% You can download directly the data in .mat format from the website
% www.computational-genomics.net and load them into the MATLAB workspace
load BLambda
%%
% The MATLAB *whos* command gives information about the size of the
% sequence.
whos BLambda
%%
%The total length of the Bacteriophage Lambda genome is 48502 bp.
%% Change-point analysis
% The local fluctuations in the frequencies of nucleotides provide
% interesting information. The local base composition by a sliding
% window of variable size can be measured. In the following the window
% size is assumed 2000 bp, 3000 bp and 4000 bp respectively.
ntdensity(BLambda,'window',2000)
%%
ntdensity(BLambda,'window',3000)
%%
ntdensity(BLambda,'window',4000)
%%
% The analysis of the plots reveals that the phage genome is composed of
% two halves with completely different GC content: the first GC rich, the
% second AT rich. This is an example of change point in a genome.
%% Segmentation with Hidden Markov Model
% You can use an HMM to segment the Lambda Phage genome into blocks of
% these two states. You can start generating random transition and emission
% matrices as input to the Expectation Maximization (EM) algorithm that
% better estimates those parameters.
T=rand(2,2);
E=rand(2,4);
% Normalize matrices
T(1,:) = T(1,:) ./ (norm(T(1,:),1));
T(2,:) = T(2,:) ./ (norm(T(2,:),1));
E(1,:) = E(1,:) ./ (norm(E(1,:),1));
E(2,:) = E(2,:) ./ (norm(E(2,:),1));
%%
% The nucleotide 'A', 'C', 'G', and 'T' are encoded by 1, 2, 3 and 4, respectively.
seq=nt2int(BLambda);
[estT, estE] = hmmtrain(seq,T,E)
%%
% With the Viterbi algorithm and the matrices previously calculated
% the sequence can be segmented.
estimatedStates = hmmviterbi(seq,estT,estE);
% You can plot nucleotide density and change points together
ntdensity(BLambda);
hold on
plot(estimatedStates-1,'k--') % for visualization the states are coded as -1/1
hold off
%%
% Now you can compare this with the segmentation obtained with the initial
% guesses of matrices.
BADestimatedStates = hmmviterbi(seq,T,E);
figure
ntdensity(BLambda);
hold on
plot(BADestimatedStates-1,'k--') %for visualization the states are coded as -1/1
hold off