Match Count Statistics In Bacterial Genomes

This web page contains links to supplementary materials for the paper Statistics of k-Neighbor Match Counts in Completely Sequenced Genomes by O. Michael Melko and Arcady R. Mushegian (to appear). For definition of terms, see wordStatsTechReport2003.pdf .

Disclaimer: The authors do not assume any legal liability or responsibility for the accuracy, completeness, or usefulness of the programs provided below, or any information derived from them.

Stowers Institute Technical Report #0002

The following technical report, entitled Match Count Statistics in Bacterial Genomes and its Application to Molecular Probe Design, contains mathematical proofs of formulae summarized in the corresponding paper.

wordStatsTechReport2003.pdf

Mathematica Notebooks

Evaluated Mathematica notebooks are provided in support of the empirical observations made in the paper and technical report. These notebooks can be used as a template for experimenting with the sample data provided below. Note that paths to files in certain input cells will have to be changed

Note: Most of these notebooks are used to analyze output produced by the FindMatches program with the exact_counts flag. Other flags will produce more detailed output files.

Distribution of Words by GC-Content

The following notebooks show the distribution of words by GC-content in ten bacterial genomes for different word lengths. Any one of them can be used interactively with the data in bacterialGenomes.zip.

words3-CG.nb
words10-CG.nb
words17-CG.nb
words25-CG.nb
words33-CG.nb
words40-CG.nb

Generating Random Words of Given GC-content

The following notebook was used to generate strings that are the concatenation of words of length m with fixed GC-content. The output files it produces were then used with the FindMatches program to determine the k-distance match counts in various bacterial genomes. randomText.nb



K-Distance Match Counts This notebook was used to study the distribution of k-distance match counts produced by the method described above.

k_dist_matches.nb

The r_1 Statistic of k-Neighbor Ratios
The following file sows a plot of the r_1 statistic described in the paper.

extremes.nb

Finding Source Specific Probes
The following notebooks use output from FindMatches to derive ratios of k-distance match counts between pairs of genomes. Probe candidates are then extracted from the ratio data. Specifically, saure_paeru_oligos.nb shows there are 307 words of length 25 in a 1% sample of S. aureus that have distance of at least 8 from P. aeruginosa, and paeru_saure_oligos.nb shows there are 1057 words of length 25 in the 1% sample of P. aeruginosa that have distance greater than 9 from of S. aureus. The files provided in bacterialGenomes.zip and sampleOutputData.zip are required to reevaluate cells.

MathReader

For those who do not have access to Mathematica, a free notebook viewer, called MathReader, is available for download from Wolfram Research at the following link: http://www.wolfram.com/products/mathreader/

C++ Source Code for Find Matches Program

The purpose of this program is to take a sample S of words of length m = pattern_length from the source sequence (source_file), and, for each word in S, count the number of occurrences of m-words in the target sequence (target_file) that have precisely k mismatches with it, for 0 <= k <= m.

The original intention was to write functions that randomly sample the source in such a way that disjoint sample sets can be scanned by separate, independent processes. The current version uses a stub to achieve this goal. The stub works as follows: a file of locations in the source (location_list) is loaded into memory as a list of integers l. Each element in l is meant correspond to a location in the source (indexing begins at 0). The locations in the source to be read are given by l[start_loc + jump_size * j], for 0 <= j <= session_length. The location list is not provided to save space, but it is easily generated.

Here is a summary of the parameters described above, as they appear in the program argument list (see also the test script in findMatches.zip). Note that mismatch_threshold (used for finding k-neighbor matches) is a "dead" parameter if the exact_counts flag is used.

argv[0]  =   find_matches          	 program name
argv[1]  =   start_loc
argv[2]  =   jump_size
argv[3]  =   session_len
argv[4]  =   mismatch_threshold      	lower bound for mismatches
argv[5]  =   pattern_length          	length of pattern
argv[6]  =   location_list           		file containing pattern locations
argv[7]  =   source_file             		file containing source text
argv[8]  =   target_file            		file containing target text
argv[9]  =   search_results_file     	file to write search results
argv[10] =   session_status_file     	file to write search status
argv[11] =   flag                    		"sup" =  suppress target lists in output
					"exact_counts" = find k-distance matches

The program was compiled and tested using GNU G++ on PC Linux and Compaq Cxx on True 64 UNIX. Note that source code documentation not up to date, but the code should be more or less self-explanatory. The source code is packaged as a zip archive. Simple test input and output with a sample script illustrating how to run the program is included.

Known issues: the program fails to detect an exact match (i.e. k =0), if it exists, between the first element in the source sample and the beginning of the target. The problem is an initialization error in the function match.cpp /get_match_counts, and is easily fixed.

Data Files

Here, we provide some data files in the format used for input in the FindMatches program and in Mathematica notebooks described above.

Bacterial Genome Files

The FindMatches program has only been tested and used with data files containing no def lines and no newline characters. For convenience, the nucleotide sequences for Staphylococcus aureus and Pseudomonas aeruginosa are provided in this format in the following archive.

Sample Output Files

The data in these files were produced using either the FindMatches program, or some of the notebooks describe above. They are provided to help illustrate its use, and as input for some of the Mathematica notebooks provided above.

In particular, the directory MinMaxTables contains lists of the infimum , median, and supremum ratio distributions (as a function of the distance parameter k) for all source-target pairs considered. The file extremes.dat is used in extremes.nb to produce a graphical summary of the r_1 statistic described in the paper. The directories PseudoPseudo, PseudoStaph, etc., contain examples of output for source-target k-distance match count searches, which are used in paeru_saure_oligos.nb and saure_paeru_oligos.nb to look for "source specific" or "good" probes.

sampleOutputData.zip

Comments or Questions


Please direct any comments or questions about this web page to omm@stowers-institute.org.