| Match Count Statistics In Bacterial Genomes This web page contains links to supplementary materials for the paper Statistics of k-Neighbor Match Counts in Completely Sequenced Genomes by O. Michael Melko and Arcady R. Mushegian (to appear). For definition of terms, see wordStatsTechReport2003.pdf . Disclaimer: The authors do not assume any legal liability or responsibility for the accuracy, completeness, or usefulness of the programs provided below, or any information derived from them. Stowers Institute Technical Report #0002 The following technical report, entitled Match Count Statistics in Bacterial Genomes and its Application to Molecular Probe Design, contains mathematical proofs of formulae summarized in the corresponding paper. wordStatsTechReport2003.pdf Mathematica Notebooks Evaluated Mathematica notebooks are provided in support of the empirical observations made in the paper and technical report. These notebooks can be used as a template for experimenting with the sample data provided below. Note that paths to files in certain input cells will have to be changed Note: Most of these notebooks are used to analyze output produced by the FindMatches program with the exact_counts flag. Other flags will produce more detailed output files. Distribution of Words by GC-Content The following notebooks show the distribution of words by GC-content in ten bacterial genomes for different word lengths. Any one of them can be used interactively with the data in bacterialGenomes.zip. words3-CG.nb words10-CG.nb words17-CG.nb words25-CG.nb words33-CG.nb words40-CG.nb Generating Random Words of Given GC-content The following notebook was used to generate strings that are the concatenation of words of length m with fixed GC-content. The output files it produces were then used with the FindMatches program to determine the k-distance match counts in various bacterial genomes. randomText.nb K-Distance Match Counts This notebook was used to study the distribution of k-distance match counts produced by the method described above. k_dist_matches.nb The r_1 Statistic of k-Neighbor Ratios The following file sows a plot of the r_1 statistic described in the paper. extremes.nb Finding Source Specific Probes The following notebooks use output from FindMatches to derive ratios of k-distance match counts between pairs of genomes. Probe candidates are then extracted from the ratio data. Specifically, saure_paeru_oligos.nb shows there are 307 words of length 25 in a 1% sample of S. aureus that have distance of at least 8 from P. aeruginosa, and paeru_saure_oligos.nb shows there are 1057 words of length 25 in the 1% sample of P. aeruginosa that have distance greater than 9 from of S. aureus. The files provided in bacterialGenomes.zip and sampleOutputData.zip are required to reevaluate cells. MathReader For those who do not have access to Mathematica, a free notebook viewer, called MathReader, is available for download from Wolfram Research at the following link: http://www.wolfram.com/products/mathreader/ C++ Source Code for Find Matches Program The purpose of this program is to take a sample S of words of length m = pattern_length from the source sequence (source_file), and, for each word in S, count the number of occurrences of m-words in the target sequence (target_file) that have precisely k mismatches with it, for 0 <= k <= m. The original intention was to write functions that randomly sample the source in such a way that disjoint sample sets can be scanned by separate, independent processes. The current version uses a stub to achieve this goal. The stub works as follows: a file of locations in the source (location_list) is loaded into memory as a list of integers l. Each element in l is meant correspond to a location in the source (indexing begins at 0). The locations in the source to be read are given by l[start_loc + jump_size * j], for 0 <= j <= session_length. The location list is not provided to save space, but it is easily generated. Here is a summary of the parameters described above, as they appear in the program argument list (see also the test script in findMatches.zip). Note that mismatch_threshold (used for finding k-neighbor matches) is a "dead" parameter if the exact_counts flag is used. argv[0] = find_matches program name argv[1] = start_loc argv[2] = jump_size argv[3] = session_len argv[4] = mismatch_threshold lower bound for mismatches argv[5] = pattern_length length of pattern argv[6] = location_list file containing pattern locations argv[7] = source_file file containing source text argv[8] = target_file file containing target text argv[9] = search_results_file file to write search results argv[10] = session_status_file file to write search status argv[11] = flag "sup" = suppress target lists in output "exact_counts" = find k-distance matches The program was compiled and tested using GNU G++ on PC Linux and Compaq Cxx on True 64 UNIX. Note that source code documentation not up to date, but the code should be more or less self-explanatory. The source code is packaged as a zip archive. Simple test input and output with a sample script illustrating how to run the program is included. Known issues: the program fails to detect an exact match (i.e. k =0), if it exists, between the first element in the source sample and the beginning of the target. The problem is an initialization error in the function match.cpp /get_match_counts, and is easily fixed. Data Files Here, we provide some data files in the format used for input in the
FindMatches program and in Mathematica notebooks described
above. |