MatSAM A Spatial Analysis Software to detect candidate loci for selection


The new version of MatSAM is named SamBada
and is available here:

http://lasig.epfl.ch/sambada


Data required - Software components - Input matrix
Data format - How to proceed - Analysis of the results
Drawing graphs - Missing values - Papers - Download - Links



E-mail : stephane.joost[at]epfl.ch


 

 

 

 

 

 

 

 

 

 

The detection of adaptive loci in the genome is important as it supports the understanding of what proportion of the genome is shaped by natural selection. It also gives the possibility to identify regions of the genome involved in adaptation processes. Several methods were developed to detect loci under selection (see a review in Joost et al. 2007 and other publications), but the uncovering of environmental parameters responsible for selection was so far difficult to realize. This is precisely a task SAM is able to fulfil.

Approach

Multiple univariate logistic regressions are carried out to test for association between allelic frequencies at marker loci and environmental variables.

Data required

The method the SAM software implements is described in details in Joost et al. (2007). Geographic coordinates of the place where the animal/ plant is sampled are necessary. They permit to retrieve environmental information to characterize the sampling location. Required data are :

  1. At least one environmental variable describing the sampling location (it is necessary to limit the precision of any environmental variable to 2 decimals only in the present version) ;
  2. A matrix with the presence (1) or the absence (0) of a given molecular marker at the sampling location.

The molecular data sets used for analysis are in the form of matrices; each row of the matrix corresponds to a sampled individual, while the columns are organised according to the sampled individual’s geographic coordinates and contain binary information (1 or 0), relating to the status of the genetic marker. For AFLP markers, the numbers 1 or 0 respectively indicate the phenotypes presence of band and absence of band . For microsatellite markers, the numbers 1 and 0 respectively, indicate the presence or absence of a given allele at the locus in question. The method was also recently successfully applied to SNPs (Pariset, Joost, Ajmone and Valentini 2009). For microsatellites and SNPs, an encoding phase is necessary, while AFLP data are ideal for logistic regression because they provide binomial information.

Software components

NEW: We recently developed a version 2 of the software, but version 1 is still available (see hereunder).

 

MatSAM version 2

New features implemented within MatSAM are:

  1. The software is able to process qualitative predictors (nominal, ordinal)
  2. MatSAM works with an input parameter file
  3. The input data matrix must contain the name of the variables (different columns)
  4. Output files use these variable names
  5. Error types are documented in output files

For details, please read the new documentation.

Register to download SAM components (access to MatSAM v1 and v2)


 

MatSAM version 1

Two files have to be dowloaded :

  1. A zip file "SAMsoftware.zip" (2.3 Mb)
  2. The Matlab Component Runtime v. 7.7 (175.5 Mb)

The zip file contains the following elements :

  1. matSAM.exe : a Windows executable file (tested only on XP) containing the main procedure : the processing of the many simultaneous logistic regression models, based on the GLMfit function, see MacCullagh & Nelder (1989);
  2. matSAM.ctf : a compressed file containing toolboxes and functions to be used by "matSAM.exe". This file will expand when "matSAM.exe" will be first launched (a new folder called "matSAM_mcr" will appear in the directory);
  3. SAMAnalysis.xls (and .xlsm) : an Excel sheet containing a macro to facilitates the analysis of the result matrix (output.txt) produced by "matSAM.exe". There is a ".xlsm" file for people using Excel 2007;
  4. SAMGraph_v2.xls (and .xlsm) : an Excel sheet with a macro to generate graphs of the models whose function has to be displayed;
  5. test_17env63mark.txt : an example of input matrix containing real data. "17env63mark" means that this file contains information on 17 environmental parameters and 63 molecular markers (AFLP data).

The main component "matSAM.exe" was developed with Matlab 1994-2007 The MathWorks Inc. The advantage of Matlab is that it is really fast and efficient to simultaneously process an important number of models. Until november 2007, the code of the SAM program was freely distributed, but this implied users to own a license of Matlab to use it. Now, the present version of SAM is compiled and can be run without having to purchase Matlab. But it still requires a few Matlab components (mainly libraries), which are made available by the Matlab Component Runtime. The drawback is that this component is heavy (175 Mb). But it can be freely distributed in a non profit perspective, for academic use.

Register to download SAM components (access to MatSAM v1 and v2)


Input matrix and data format

During the preparation of the input matrix, it is important to sort out both environmental variables and genetic markers according to any criteria, and then to number the environmental variables and the markers according to this criteria. Typically a good idea is to sort out and number markers according to their frequency among sampled animals or plants, and to produce a matrix of genetic markers with low frequencies on the left and high frequencies on the right. For environmental variables, it can be the thematic order or the alphabetical order.

The input matrix has to be a text file (.txt), delimited with spaces. The initial row (title line with the name of the environmental variables and the name of the loci or alleles), as well as the initial column (name of animals or samples) have to be removed to be processed by the "matSAM.exe" program.

Keep the original Excel file in which you have the titles of the columns and of the rows: you will refer to this file to :

  1. Enter the names of environmental variables and copy the list of genetic markers into the "SAMAnalysis.xls" Excel file for analysis;
  2. Find out what is the number of a given environmental variable and what is the number of a genetic marker to draw the graph of the models to display. [Top of the page]

Figure 1 : Original data file with the first row containing variable names. Both environmental and genetic information are in the same file. In green, on the left of the matrix, environmental variables (it is necessary to limit the precision of any environmental variable to 2 decimals only in the present version); in orange, on the right fo the matrix, presence or absence of loci or alleles (click on the image to enlarge the figure in a new window - idem for all figures of this page).

Figure 2 : The same matrix ready to be processed by "matSAM.exe", without title row neither title column. This figure shows a matrix with 9 environmental variables and 7 markers.

 

 

 

 

 

 

 

 

How to proceed

  1. Register to download files "SAMsoftware.zip" and "MCRInstaller.exe";
  2. Install MCRInstaller.exe. The installer will propose to create the following folders \MATLAB\MATLAB Component Runtime, accept and wait the installation process to finish. This can take several minutes;
  3. Extract the content of the "SAMsoftware.zip" in a folder (all elements have to be in the same folder);
  4. Open the MS DOS command console ("Start" in the Windows toolbar, then "Run...", "cmd" in the text area, and then "OK".) and run "matSAM.exe";
  5. "matSAM.exe" has to be run from the MS DOS console (no double click on the "matSAM.exe" file ! it won't work !). Before running "matSAM.exe", be sure there is no existing "output.txt" or "graph.txt" files in the current folder. The program does not overwrite the result files it produces.
  6. Write the following command (square brackets are shown here to highlight elements separated by spaces, and are not part of the command). This is a 4 arguments command :
    C:\path...\matSAM.exe [nameFile.txt] [nb env. variables] [nb markers] [1]
    • Element 1 is the name of the input matrix (space separated text file);
    • Element 2 is the number of environmental variables;
    • Element 3 is the number of genetic markers;
    • Element 4 is the type of the function (enter "1" for the sigmoid function, the only one that is implemented yet. The gaussian will be added in the following version).
  7. To run "matSAM.exe" on the example file provided ( _ = space):
    C:\path...\matSAM.exe_test_17env63mark.txt_17_63_1
  8. First "matSAM.exe" will expand the ".ctf" file (it takes about 10 seconds, see above). Do not mind the message about the system locale settings. "matSAM.exe" will run (model processing) until the prompt with the path is displayed again, and 2 files appear in the current folder: "output.txt" and "graph.txt". [Top of the page]

Summary
Example with successive commands to enter in the MS DOS command console (commands are written in blue)

First command (= selection of the right volume):
D:

Second command (= go to the right folder):
cd D:\Code\Sam\SAMtorun\

Third command (= launch matSAM and process the right file):
matSAM.exe test_17env63mark.txt 17 63 1

 

Analysis of the results (SAMAnalysis.xls)

"SAMAnalysis.xls" helps you to deal with the many p-values produced by both G and Wald statistical tests, and to identify the more significant associations. Open the file and run the macro (click the "Run SAM Analysis" button) : you have to indicate the number of environmental variables used and the number of loci or alleles used (refer to your original input file). This permits the program to prepare the structure of the analysis table (called "rejection table" because it permits to reject models which are not significant according to a significance level you decide). To interrupt a macro, click on the "Cancel" button in the Input Window, and then on the "End" button.

Once you have indicated the names of the differents env. variables and an initial confidence level (default = 0.01 or 99%), a series of grey cells will be drawn : you have to "copy/paste special, values" of the result matrix exported by "matSAM.exe" ("output.txt") in this grey area (the grey area is here just to provide a way to control the size of the matrix). There is a default marker name you can modify : copy and paste your list of markers in your original input matrix.

Structure of the table

The table containing the results is made of 15 different groups of statistical data. Each group is constituted of n rows, where n represents the number of environmental variables. Columns correspond to genetic markers. These groups contain the following information :

1 Log Likelyhood2
2 Log Likelyhood1
3 Degrees of freedom
4 G value
5 P value for G
6 Null hypothesis rejected for G (default confidence level = 99%)
7 Wald for Beta 0
8 Wald for Beta 1
9 P value for Wald Beta 0
10 P value for Wald Beta 1
11 Null hypothesis rejected for Wald Beta 0 (default confidence level = 99%)
12 Null hypothesis rejected for Wald Beta 1 (default confidence level = 99%)

The 3 next groups constitute the dynamic section of the rejection table.

13 Dynamic null hypothesis analysis for G and Wald Beta 1 : Null hypothesis for G
14 Dynamic null hypothesis analysis for G and Wald Beta 1 : Null hypothesis for Wald Beta 1
15 Dynamic null hypothesis analysis for G and Wald Beta 1 : Cumulated test

We will focus on those 3 groups to carry out the analysis. The other groups contain the basic statistics and are made available in case it is necessary to refine the analysis. [Top of the page]

Dynamic table of analysis

The number of environmental variables, and the number of loci or alleles indicated to the program allowed to generate formulas stored at the bottom of the file. These formulas make it possible to set up a dynamic rejection table, whose results will evolve according to the confidence level you chose. You have to copy/paste these formulas in the appropriate cell (don't forget to add "=" before the formula), and then to drag outside selection to extend the series to the last column on the right (corresponding to the last locus or allele). Then select the whole row (all cells to which a formula was added) and drag down selection to extend the series to the last row (corresponding to the last environmental variable). Repeat the same operation with the other 2 formulas corresponding to groups 14 and 15.


Figure 3


Figure 4


Figure 5

In these 3 groups, cells display a "1" when the null hypothesis is rejected for the chosen confidence level, and a "0" when the null hypothesis is not rejected (the investigated variable does not significantly contribute in explaining more variance than a model with a constant only). In the last group (15), cells show a "1" only when both tests (G and Wald) failed to reject the null hypothesis (the reason is statistical robustness, see Joost et al. 2007).

On the 3 matrices (13, 14, 15) apply the Excel conditional formating with a given color when cells contain a 1, and no color when cells contain a 0 (see figure 6). This way, significant models are dynamically highlighted when you change the significance threshold.[Top of the page]

Figure 6 : With a confidence level of 99% (1E-02 or 0.01) - Bonferroni correction included - , alleles AGA144 is significantly associated with the environmental variable "mnt25", and allele AGA146 is significantly associated with environmental variables "etptcompc1" and "sradpcompcc1". The allele AGA148-3 is significantly associated with the "hillshade" variable with the G test only, and not with the Wald test.

Adapting significance level

Another formula has to be inserted before the analysis table is dynamic. Copy the last formula at the bottom of the file into the appropriate cell. This formula permits to correct the confidence level in order to take into account the multiple hypotheses testing context. Here we simply apply the Bonferroni correction : the confidence level you choose (yellow cell in figure 6) is divided by the number of models (cell B146 in figure 6) and the result stored in cell A140. This last cell is used to reject or not the null hypothesis, what makes the approach very conservative (see arguments in Joost et al. 2007). [Top of the page]

Drawing graphs of the models (SAMGraph_v2.xls)

Warning : a compatibility problem exists between Excel 2003 and Excel 2007. It is not solved yet. SAMGraph correctly functions with Excel 2007 only. It will be corrected as soon as possible.

To draw the graph of the logistic function (sigmoid) corresponding to a given genetic marker and a given environmental variable, it is necessary to use their identification number. The identification number corresponds to the initial order used to constitute the matrix analysed by MatSAM. A suggestion is to sort out the genetic markers according to their frequency among sampled animals to constitute this matrix (from lower frequencies on the left, to higher frequencies to the right). About environmental variables, just choose an order and alway use the same.
The file "graph.txt" exported by "matSAM.exe" has the following structure. Each model is constituted of 4 lines :

  • Line 1 : values of the environmental variable;
  • Line 2 : presence or absence of the genetic marker;
  • Line 3 : subdivision of the X axis (scale given by the statistic distribution of the environmental variable investigated);
  • Line 4 : probability that the genetic marker is present for the corresponding environmental variable.

The SamGraph macro builds the graph according to those series of 4 lines, written one after the other in the "graph.txt" file.
To produce the graph, open the "graph.txt" file into Excel and count the number of columns (number of values generated by "matSAM.exe" to create the subdivision of the X axis = number of animals or plants sampled). Then open the "SamGraph_v2.xls" Excel file and run the "SamGraphA" macro (click on the "Draw matrix area" button). The program will prompt for the number of loci or alleles and for the number of environmental variables, and will draw the cells of corresponding matrix in grey. Copy the matrix (the content of the "Graph.txt" file) and paste it this grey area (the grey area is a mean to check for the size of the matrix).
Then, you will be able to use the "SamGraphB" macro to choose the model you want to draw the graph. Run the "SamGraphB" macro (click on the "Draw graph for another model" button), and the program prompts for the total number of environmental parameters in analysis, for the number of the wanted locus or allele (see your original input matrix), for the number of the environmental variable (idem), and for the number of columns in the "graph.txt" file, before drawing the corresponding graph (figure 7). [Top of the page]


Figure 7

Missing values

The "NaN" command (N=upper case a= lower-case N=upper case) can be placed where you have a missing value (an environmental variable value or the presence/absence of a marker) in the input matrix (environmental variables or presence or absence of a given marker). In Matlab syntax, NaN means "Not-a-Number". The main impact is that the G test cannot be computed when the presence/absence of a marker show a missing value, and "Nan" will appear in the matrix of results in the corresponding column. But this does not affect the Wald test. In this case, you will have to assess your results on the basis of the Wald test only, and this makes the "Cumulated test" mentioned here above unusable. This is due to the elements used by both tests to produce a statistical test. G = -2 ln (likelyhood of the initial model with a constant only/likelyhood of the new model including the examined variable); the distribution of this statistic is a chi-square with a number of degrees of freedom equal to the number of investigated parameters.

     
Resp. : Prof. P. Ajmone Marsan, Institute of Zootechnics, UCSC, Piacenza, Italy
Ed. : S.Joost,
GIS Lab, EPFL, Switzerland
Last update of this page : 19.08.2010