|
|
The detection of adaptive loci in the genome is important as it supports the understanding of what proportion of the genome is shaped by natural selection. It also gives the possibility to identify regions of the genome involved in adaptation processes. Several methods were developed to detect loci under selection (see a review in Joost et al. 2007), but the uncovering of environmental parameters responsible for selection was so far difficult to realize. This is precisely a task SAM is able to fulfil. Approach Multiple univariate logistic regressions are carried out to test for association between allelic frequencies at marker loci and environmental variables. The method the SAM software implements is described in details in Joost et al. (2007). Geographic coordinates of the place where the animal/ plant is sampled are necessary. They permit to retrieve environmental information to characterize the sampling location. Required data are :
The molecular data sets used for analysis are in the form of matrices; each row of the matrix corresponds to a sampled individual, while the columns are organised according to the sampled individual’s geographic coordinates and contain binary information (1 or 0), relating to the status of the genetic marker. For AFLP markers, the numbers 1 or 0 respectively indicate the phenotypes « presence of band » and « absence of band ». For microsatellite markers, the numbers 1 and 0 respectively, indicate the presence or absence of a given allele at the locus in question. The method was also recently successfully applied to SNPs (Pariset, Joost, Ajmone and Valentini 2009). For microsatellites and SNPs, an encoding phase is necessary, while AFLP data are ideal for logistic regression because they provide binomial information. Two files have to be dowloaded :
The zip file contains the following elements :
The main component "matSAM.exe" was developed with Matlab ©1994-2007 The MathWorks Inc. The advantage of Matlab is that it is really fast and efficient to simultaneously process an important number of models. Until november 2007, the code of the SAM program was freely distributed, but this implied users to own a license of Matlab to use it. Now, the present version of SAM is compiled and can be run without having to purchase Matlab. But it still requires a few Matlab components (mainly libraries), which are made available by the Matlab Component Runtime. The drawback is that this component is heavy (175 Mb). But it can be freely distributed in a non profit perspective, for academic use. Register to download SAM components
During the preparation of the input matrix, it is important to sort out both environmental variables and genetic markers according to any criteria, and then to number the environmental variables and the markers according to this criteria. Typically a good idea is to sort out and number markers according to their frequency among sampled animals or plants, and to produce a matrix of genetic markers with low frequencies on the left and high frequencies on the right. For environmental variables, it can be the thematic order or the alphabetical order. The input matrix has to be a text file (.txt), delimited with spaces. The initial row (title line with the name of the environmental variables and the name of the loci or alleles), as well as the initial column (name of animals or samples) have to be removed to be processed by the "matSAM.exe" program. Keep the original Excel file in which you have the titles of the columns and of the rows: you will refer to this file to :
Analysis of the results (SAMAnalysis.xls) "SAMAnalysis.xls" helps you to deal with the many p-values produced by both G and Wald statistical tests, and to identify the more significant associations. Open the file and run the macro (click the "Run SAM Analysis" button) : you have to indicate the number of environmental variables used and the number of loci or alleles used (refer to your original input file). This permits the program to prepare the structure of the analysis table (called "rejection table" because it permits to reject models which are not significant according to a significance level you decide). To interrupt a macro, click on the "Cancel" button in the Input Window, and then on the "End" button. The table containing the results is made of 15 different groups of statistical data. Each group is constituted of n rows, where n represents the number of environmental variables. Columns correspond to genetic markers. These groups contain the following information : 1 Log Likelyhood2 The 3 next groups constitute the dynamic section of the rejection table. 13 Dynamic null hypothesis analysis for G and Wald Beta 1 : Null hypothesis for G We will focus on those 3 groups to carry out the analysis. The other groups contain the basic statistics and are made available in case it is necessary to refine the analysis. [Top of the page] The number of environmental variables, and the number of loci or alleles indicated to the program allowed to generate formulas stored at the bottom of the file. These formulas make it possible to set up a dynamic rejection table, whose results will evolve according to the confidence level you chose. You have to copy/paste these formulas in the appropriate cell (don't forget to add "=" before the formula), and then to drag outside selection to extend the series to the last column on the right (corresponding to the last locus or allele). Then select the whole row (all cells to which a formula was added) and drag down selection to extend the series to the last row (corresponding to the last environmental variable). Repeat the same operation with the other 2 formulas corresponding to groups 14 and 15.
In these 3 groups, cells display a "1" when the null hypothesis is rejected for the chosen confidence level, and a "0" when the null hypothesis is not rejected (the investigated variable does not significantly contribute in explaining more variance than a model with a constant only). In the last group (15), cells show a "1" only when both tests (G and Wald) failed to reject the null hypothesis (the reason is statistical robustness, see Joost et al. 2007). On the 3 matrices (13, 14, 15) apply the Excel conditional formating with a given color when cells contain a 1, and no color when cells contain a 0 (see figure 6). This way, significant models are dynamically highlighted when you change the significance threshold.[Top of the page] Another formula has to be inserted before the analysis table is dynamic. Copy the last formula at the bottom of the file into the appropriate cell. This formula permits to correct the confidence level in order to take into account the multiple hypotheses testing context. Here we simply apply the Bonferroni correction : the confidence level you choose (yellow cell in figure 6) is divided by the number of models (cell B146 in figure 6) and the result stored in cell A140. This last cell is used to reject or not the null hypothesis, what makes the approach very conservative (see arguments in Joost et al. 2007). [Top of the page] Drawing graphs of the models (SAMGraph_v2.xls) Warning : a compatibility problem exists between Excel 2003 and Excel 2007. It is not solved yet. SAMGraph correctly functions with Excel 2007 only. It will be corrected as soon as possible. To draw the graph of the logistic function (sigmoid) corresponding to a given genetic marker and a given environmental variable, it is necessary to use their identification number. The identification number corresponds to the initial order used to constitute the matrix analysed by MatSAM. A suggestion is to sort out the genetic markers according to their frequency among sampled animals to constitute this matrix (from lower frequencies on the left, to higher frequencies to the right). About environmental variables, just choose an order and alway use the same.
The SamGraph macro builds the graph according to those series of 4 lines, written one after the other in the "graph.txt" file. The "NaN" command (N=upper case a= lower-case N=upper case) can be placed where you have a missing value (an environmental variable value or the presence/absence of a marker) in the input matrix (environmental variables or presence or absence of a given marker). In Matlab syntax, NaN means "Not-a-Number". The main impact is that the G test cannot be computed when the presence/absence of a marker show a missing value, and "Nan" will appear in the matrix of results in the corresponding column. But this does not affect the Wald test. In this case, you will have to assess your results on the basis of the Wald test only, and this makes the "Cumulated test" mentioned here above unusable. This is due to the elements used by both tests to produce a statistical test. G = -2 ln (likelyhood of the initial model with a constant only/likelyhood of the new model including the examined variable); the distribution of this statistic is a chi-square with a number of degrees of freedom equal to the number of investigated parameters. |
||||||
| Resp.
: Prof. P. Ajmone Marsan,
Institute of Zootechnics, UCSC, Piacenza,
Italy Ed. : S.Joost, Institute of Zootechnics, UCSC, Piacenza, Italy |
|