bioNMF is a web-based tool designed to provide a set of analysis methods using the Non-negative Matrix Factorization (NMF) model, widely used in the last few years by the biomedical community due to its intuitive nature, since it has the ability to extract highly-interpretable additive parts from datasets, reducing therefore its dimensionality.
This online tool provides a user-friendly interface, combined with a computational efficient parallel implementation of the NMF algorithm to explore data in different analysis scenarios:
The NMF algorithm takes one input matrix (V) and returns two non-negative matrices (W and H) whose product is equal to V (ie. V ∼ W*H). See NMF Algorithms for more information.
If matrix V has N rows and M columns, then dimensions for output matrices W and H will be N × K and K × M respectively, where K is the selected factorization rank. This value can be specified by the user or bioNMF can find the best one within a given input range, by computing the Cophenetic Correlation Coefficient (CCC) for each of these ranks. See Analysis Options for more information.
In addition to the online access, bioNMF also provides the same functionality included in the web site as a public Web Services interface, enabling users with more computer expertise to launch jobs into the bioNMF server from their own scripts and workflows.
IMPORTANT NOTES:
.xls
, .mat
nor .cvs
files are accepted.'A'
-'Z'
, 'a'
-'z'
, '0'
-'9'
, '_'
, '-'
, '.'
, and '+'
.
Therefore, filenames must not have characters such as space, TAB or tilde ('˜'
).By default, bioNMF accepts input text files where data is separated by single TAB characters. It might also contain row labels and/or column headers, as well as a short description string at the beginning of the file (any of these three elements is optional). Each header, label or the description string might be composed by multiple space-separated words and/or numbers. If set, both column-headers and description string must be located at the first line.
Important Notes:
'.'
) as a decimal symbol, not a comma.LF
, '\n'
) and MS Windows (CR
+LF
, '\r\n'
) end-of-line styles are accepted.See examples of valid text-file formats: with labels and without labels (if you use any of these matrices, please use one of the preprocessing methods listed below in order to set all their values as positive).
Under the following conditions, input data might be separated by SINGLE space characters (' '
):
In addition, bioNMF also accepts binary files encoded using IEEE little-endian byte ordering. Data must be written in the following format:
'\n'
) character (in ASCII-text format).'\n'
) character (in ASCII-text format).'\n'
) are mandatory if any of headers, labels or the name fields is set. In addition, only UNIX (LF
) and MS-DOS (CR
+LF
) end-of-line styles are accepted.To select your input file format (text or binary), as well as some additional options, just click on the 'Input file options' menu.
In this on-line version, size and dimensions of acepted input matrices are limited to the following values:
Even if NMF seems to be an especially robust algorithm compared with other clustering methods like hierarchical clustering or SOM, a previous normalization step is may be convenient to make more evident the patterns of interest. Therefore, data normalization is provided as a pre-processing step before computing the factorization. To select a preprocessing method, just click on the 'Preprocessing options' menu.
Before data normalization, you might transpose your input data matrix in case your data items are stored as input data rows (select 'Transpose data for analysis').
Available normalization methods are:
After the normalization step, or if the data originally contains negative values, this tool offers a set of methodologies to make it positive:
bioNMF can be used to perform three types of analysis:
This module performs a non-negative matrix factorization of the input data. It returns the resulting W and H matrices corresponding to the best factorization rank (ie. rank of the most stable clustering) within a given input range.
A quantitative measure of cluster stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability).
Note: By default, this module executes the 'Standard' NMF variant.
See a description of CCC, available NMF algorithms and other input parameters in the Analysis Options section.
Factors | matrix W (with labels if present in the input matrix). |
Encoding vectors | matrix H (with labels if present in the input matrix). |
Cophenetic coefficients | Vector of Cophenetic Correlation Coefficients for the given range of factorization ranks (only if a range is supplied). |
See an example of analysis output here.
This analysis method, proposed by Carmona-Saez et al. (BMC Bioinformatics, 2006), is intended mainly for Gene-Expression analysis, although its applications can be extended to other type of data. Taking gene expression as a case of study, this method groups genes and samples based on local features generating sets of samples and genes that are locally related. The result is a set of K biclusters (sub-matrices) encoding modular patterns, where K is the best factorization rank within a given input range. Each bicluster matrix contains the set of genes that are highly associated to a local pattern and samples sorted by its importance in this pattern.
The best factorization rank correspond to the most stable clustering. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability).
Note: By default, this module executes the 'Non-smooth"' NMF variant.
See a description of CCC, available NMF algorithms and other input parameters in the Analysis Options section.
Factors | matrix W (with labels if present in the input matrix). |
Encoding vectors | matrix H (with labels if present in the input matrix). |
Cophenetic coefficients | Vector of Cophenetic Correlation Coefficients for the given range of factorization ranks (only if a range is supplied). |
Bicluster data | Bicluster submatrix |
Bicluster row-indexes | Contains the row indexes of the extracted biclusters |
Bicluster column-indexes | Contains the column indexes of the extracted biclusters |
The bicluster matrix contains the set of genes that are highly associated to a local pattern and samples sorted by its importance in this pattern. An image of the heatmap of each bicluster can be optionally generated (see save output images).
In this example a heatmap with the subset of genes and all samples (sorted by their association to the local pattern) is shown on the bottom part of the image. The plot on the upper part represents the coefficients of all samples in the corresponding row of matrix H. Marked in blue are samples that show the largest coefficient for that factor while in green are marked samples that show largest coefficients in other rows in H. More details can be found in (Carmona-Saez et al. 2006). The consistency value represents the percentage of times that a bicluster is found in all the runs of the algorithm (see number of runs). For this purpose, two biclusters are assumed to be similar if they share at least 75% of their genes. Finally, a bicluster is considered to be enough consistent if it is found in 80% or more of the factorizations.See an example of analysis output here.
This module implements the method proposed by Brunet et al. (PNAS 2004) to determine the most suitable number of sample clusters in a given dataset and to group the data samples into K clusters, being K the best factorization rank within a given input range. Results will be an estimation of the best number of clusters in the data set and the cluster assignments of each experimental condition.
The best factorization rank correspond to the one of the most stable clustering. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability).
Note: By default, this module executes the 'Divergence' NMF variant.
See a description of CCC, available NMF algorithms and other input parameters in the Analysis Options section.
Factors | matrix W (with labels if present in the input matrix). |
Encoding vectors | matrix H (with labels if present in the input matrix). |
Cophenetic coefficients | Vector of Cophenetic Correlation Coefficients for the given range of factorization ranks. |
Cluster IDs | For each factorization rank: a file indicating in which class was assigned each sample (i.e., each column of the input matrix). |
A graphical output of sample classifications can be optionally generated (see show output images). It shows the reordered consensus matrix and cophenetic correlation coefficient computed for each rank used in the analysis.
See an example of analysis output here.
All analysis methods can be controlled by several parameters:
NMF decomposes your data into K clusters, being K (a.k.a. Factorization Rank) the inner dimension of the matrix product: W*H. bioNMF can find the best factorization rank within a given input range, by computing the Cophenetic Correlation Coefficient (CCC) for each of these ranks.
CCC is a quantitative measure of clustering stability. It is based on the Consensus clustering method which exploits the stochastic nature of the NMF algorithm (see Brunet et. al, PNAS 2004). Since the NMF algorithm is non-deterministic, its solutions might vary from run to run when executed with different random initial values for W and H. If the factorization is stable for a given value, K, it is expected that data assignments to these K clusters would vary little from run to run.
CCC values will vary from 1 (a perfect stability) to 0 (instability). The best factorization rank then corresponds to the one with the highest CCC value. This method is probably one of the most used methods in the field to estimate the best factorization rank.
On our own experience, a value of 100 runs per factorization rank (see 'Number of runs' parameter) is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].
This method is always used if a range of factorization ranks is supplied, or if the Sample Classification analysis method is selected.
Notes:
Due to the non-deterministic nature of NMF, it may or may not converge to the same solution on each run depending on the random initial conditions. Therefore, executing the algorithm several times with different random initializations is a good approach for selecting the W and H matrices that best approximates the input matrix. Depending on the problem, more or less runs will be necessary to achieve an optimum solution. However, considering that the computational cost of this algorithm is very high, a limited number of runs is recommended. On our own experience, a value of 100 runs is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].
If you select a range of factorization ranks, bioNMF will try the specified number of runs for each rank in that range.
Notes:
Optionally, you can supply initial values for the output matrices, W and H, instead of initialize them with random values. This will considerably reduce the analysis time since you do not have to run the NMF algorithm more than once.
If you provide these files, the following analysis options will be disabled and set to a fixed value:
IMPORTANT NOTES:
This parameter controls the maximum number of iterations per run to allow algorithm convergence. A maximum of 2000 iterations is enough on most cases.
Note: The maximum number of iterations allowed for this on-line version is 4096.
This parameter controls the algorithm convergence on each run. bioNMF makes use of the convergence method described in [Brunet et. al, PNAS 2004].
Each 10 iterations, a connectivity matrix C of size M × M is computed, where M is the number of columns of matrix H. Each entry Cij in this matrix is set to 1 if column i and j in H have their maximum value for the same factor (ie. on the same row in H), and 0 otherwise.
If the connectivity matrix stop changing after a certain number of iterations (equals to the stopping threshold multiplied by 10), the matrices are considered as having converged and the algorithm stops the current run.
Note: This parameter has no effect if it is greater than the maximum number of iteration divided by 10.
This parameter controls the mode on which output files are saved:
If this option is selected, bioNMF will generate heatmaps graphics and profile plots for input and output matrices.
If the input matrix contains negative data, bioNMF uses a red-green colormap to generate the heatmap. Otherwise, a blue-red scale is used. Heatmaps for output files always uses a blue-red colormap.
If this option is selected, bioNMF will generate output images of the selected analysis-method:
The NMF algorithm takes one input matrix (V) and returns two non-negative matrices (W and H) whose product is equal to V (ie. V ∼ W*H). bioNMF lets you choose between the following variants of the NMF algorithm:
This is the classical algorithm proposed by D. D. Lee and H. S. Seung (Nature, 1999).
Variant proposed by D. D. Lee and H. S. Seung from a divergence cost function (NIPS, 2001).
Variant proposed by A. Pascual-Montano et al. (IEEE TPAMI, 2006; BMC Bioinformatics, 2006). The cost function is derived by introducing an extra smoothness matrix (S) in order to demand sparseness to data.
The positive symmetric matrix S∈R^{K×K} (where K is the current factorization rank) is a smoothing matrix defined as:
If you use this software, please cite the following work: