Last updated: 2011/05/17.
Please refresh your browser's cache (press F5 or Ctrl-R).

   
[bioNMF] [Webservices]

HELP

bioNMF is a web-based tool designed to provide a set of analysis methods using the Non-negative Matrix Factorization (NMF) model, widely used in the last few years by the biomedical community due to its intuitive nature, since it has the ability to extract highly-interpretable additive parts from datasets, reducing therefore its dimensionality.

This online tool provides a user-friendly interface, combined with a computational efficient parallel implementation of the NMF algorithm to explore data in different analysis scenarios:

The NMF algorithm takes one input matrix (V) and returns two non-negative matrices (W and H) whose product is equal to V (ie. VW*H). See NMF Algorithms for more information.
If matrix V has N rows and M columns, then dimensions for output matrices W and H will be N × K and K × M respectively, where K is the selected factorization rank. This value can be specified by the user or bioNMF can find the best one within a given input range, by computing the Cophenetic Correlation Coefficient (CCC) for each of these ranks. See Analysis Options for more information.

In addition to the online access, bioNMF also provides the same functionality included in the web site as a public Web Services interface, enabling users with more computer expertise to launch jobs into the bioNMF server from their own scripts and workflows.


1. UPLOADING DATA

IMPORTANT NOTES:

ASCII-text input files:

By default, bioNMF accepts input text files where data is separated by single TAB characters. It might also contain row labels and/or column headers, as well as a short description string at the beginning of the file (any of these three elements is optional). Each header, label or the description string might be composed by multiple space-separated words and/or numbers. If set, both column-headers and description string must be located at the first line.

Important Notes:

See examples of valid text-file formats: with labels and without labels (if you use any of these matrices, please use one of the preprocessing methods listed below in order to set all their values as positive).

Under the following conditions, input data might be separated by SINGLE space characters (' '):

Binary input files:

In addition, bioNMF also accepts binary files encoded using IEEE little-endian byte ordering. Data must be written in the following format:

Please note that newline characters ('\n') are mandatory if any of headers, labels or the name fields is set. In addition, only UNIX (LF) and MS-DOS (CR+LF) end-of-line styles are accepted.
See an example of how to save a file in this format with this simple Matlab script.

Input file options

To select your input file format (text or binary), as well as some additional options, just click on the 'Input file options' menu.

Upload data

Limited on-line version

In this on-line version, size and dimensions of acepted input matrices are limited to the following values:

Please note the following:






2. DATA PREPROCESSING

Even if NMF seems to be an especially robust algorithm compared with other clustering methods like hierarchical clustering or SOM, a previous normalization step is may be convenient to make more evident the patterns of interest. Therefore, data normalization is provided as a pre-processing step before computing the factorization. To select a preprocessing method, just click on the 'Preprocessing options' menu.

Preprocessing data dialog

Before data normalization, you might transpose your input data matrix in case your data items are stored as input data rows (select 'Transpose data for analysis').

Available normalization methods are:

After the normalization step, or if the data originally contains negative values, this tool offers a set of methodologies to make it positive:

Note: Fold data by rows and Fold data by columns are reserved for Standard-NMF analysis method.






3. ANALYSIS METHOD

bioNMF can be used to perform three types of analysis:

  1. Standard NMF: just executes the selected NMF algorithm.
  2. Bicluster Analysis: Clusters highly-related genes and samples.
  3. Sample Classification: Unsupervised classification method of experimental samples.

3.1 Standard NMF:

This module performs a non-negative matrix factorization of the input data. It returns the resulting W and H matrices corresponding to the best factorization rank (ie. rank of the most stable clustering) within a given input range.

A quantitative measure of cluster stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability).

Note: By default, this module executes the 'Standard' NMF variant.

See a description of CCC, available NMF algorithms and other input parameters in the Analysis Options section.

Standard NMF

Output files:

Factors matrix W (with labels if present in the input matrix).
Encoding vectors matrix H (with labels if present in the input matrix).
Cophenetic coefficients Vector of Cophenetic Correlation Coefficients for the given range of factorization ranks (only if a range is supplied).

Note: If the input matrix contains labels, a copy without labels (ie. a numeric-only matrix) of each output matrix is also saved.

See an example of analysis output here.


3.2 Biclustering Analysis:

This analysis method, proposed by Carmona-Saez et al. (BMC Bioinformatics, 2006), is intended mainly for Gene-Expression analysis, although its applications can be extended to other type of data. Taking gene expression as a case of study, this method groups genes and samples based on local features generating sets of samples and genes that are locally related. The result is a set of K biclusters (sub-matrices) encoding modular patterns, where K is the best factorization rank within a given input range. Each bicluster matrix contains the set of genes that are highly associated to a local pattern and samples sorted by its importance in this pattern.

The best factorization rank correspond to the most stable clustering. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability).

Note: By default, this module executes the 'Non-smooth"' NMF variant.

See a description of CCC, available NMF algorithms and other input parameters in the Analysis Options section.

Biclustering Analysis

Output files:

Factors matrix W (with labels if present in the input matrix).
Encoding vectors matrix H (with labels if present in the input matrix).
Cophenetic coefficients Vector of Cophenetic Correlation Coefficients for the given range of factorization ranks (only if a range is supplied).

Additionally, for each bicluster found:
Bicluster data Bicluster submatrix
Bicluster row-indexes Contains the row indexes of the extracted biclusters
Bicluster column-indexes Contains the column indexes of the extracted biclusters

Note: If the input matrix contains labels, a copy without labels (ie. a numeric-only matrix) of each output matrix is also saved.


The bicluster matrix contains the set of genes that are highly associated to a local pattern and samples sorted by its importance in this pattern. An image of the heatmap of each bicluster can be optionally generated (see save output images).

Biclustering Analysis example
In this example a heatmap with the subset of genes and all samples (sorted by their association to the local pattern) is shown on the bottom part of the image. The plot on the upper part represents the coefficients of all samples in the corresponding row of matrix H. Marked in blue are samples that show the largest coefficient for that factor while in green are marked samples that show largest coefficients in other rows in H. More details can be found in (Carmona-Saez et al. 2006). The consistency value represents the percentage of times that a bicluster is found in all the runs of the algorithm (see number of runs). For this purpose, two biclusters are assumed to be similar if they share at least 75% of their genes. Finally, a bicluster is considered to be enough consistent if it is found in 80% or more of the factorizations.

See an example of analysis output here.


3.3 Sample Classification:

This module implements the method proposed by Brunet et al. (PNAS 2004) to determine the most suitable number of sample clusters in a given dataset and to group the data samples into K clusters, being K the best factorization rank within a given input range. Results will be an estimation of the best number of clusters in the data set and the cluster assignments of each experimental condition.

The best factorization rank correspond to the one of the most stable clustering. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability).

Note: By default, this module executes the 'Divergence' NMF variant.

See a description of CCC, available NMF algorithms and other input parameters in the Analysis Options section.

Sample Classification

Output files:

Factors matrix W (with labels if present in the input matrix).
Encoding vectors matrix H (with labels if present in the input matrix).
Cophenetic coefficients Vector of Cophenetic Correlation Coefficients for the given range of factorization ranks.
Cluster IDs For each factorization rank: a file indicating in which class was assigned each sample (i.e., each column of the input matrix).

A graphical output of sample classifications can be optionally generated (see show output images). It shows the reordered consensus matrix and cophenetic correlation coefficient computed for each rank used in the analysis.

Sample Classification example

See an example of analysis output here.






4. ANALYSIS OPTIONS

All analysis methods can be controlled by several parameters:

  1. Range of factorization ranks.
  2. Number of runs per rank.
  3. User initialization files.
  4. Number of iterations per run.
  5. Stopping threshold.
  6. Saving mode of output files.
  7. Heatmaps and profile plots.
  8. Analysis's output images.
  9. NMF Algorithm.

Analysis options

4.1 Range of factorization ranks:

NMF decomposes your data into K clusters, being K (a.k.a. Factorization Rank) the inner dimension of the matrix product: W*H. bioNMF can find the best factorization rank within a given input range, by computing the Cophenetic Correlation Coefficient (CCC) for each of these ranks.

CCC is a quantitative measure of clustering stability. It is based on the Consensus clustering method which exploits the stochastic nature of the NMF algorithm (see Brunet et. al, PNAS 2004). Since the NMF algorithm is non-deterministic, its solutions might vary from run to run when executed with different random initial values for W and H. If the factorization is stable for a given value, K, it is expected that data assignments to these K clusters would vary little from run to run.

CCC values will vary from 1 (a perfect stability) to 0 (instability). The best factorization rank then corresponds to the one with the highest CCC value. This method is probably one of the most used methods in the field to estimate the best factorization rank.

On our own experience, a value of 100 runs per factorization rank (see 'Number of runs' parameter) is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].

This method is always used if a range of factorization ranks is supplied, or if the Sample Classification analysis method is selected.

Notes:


4.2 Number of runs:

Due to the non-deterministic nature of NMF, it may or may not converge to the same solution on each run depending on the random initial conditions. Therefore, executing the algorithm several times with different random initializations is a good approach for selecting the W and H matrices that best approximates the input matrix. Depending on the problem, more or less runs will be necessary to achieve an optimum solution. However, considering that the computational cost of this algorithm is very high, a limited number of runs is recommended. On our own experience, a value of 100 runs is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].

If you select a range of factorization ranks, bioNMF will try the specified number of runs for each rank in that range.

Notes:


4.3 User initialization files

Optionally, you can supply initial values for the output matrices, W and H, instead of initialize them with random values. This will considerably reduce the analysis time since you do not have to run the NMF algorithm more than once.

If you provide these files, the following analysis options will be disabled and set to a fixed value:

IMPORTANT NOTES:


4.4 Number of iterations:

This parameter controls the maximum number of iterations per run to allow algorithm convergence. A maximum of 2000 iterations is enough on most cases.

Note: The maximum number of iterations allowed for this on-line version is 4096.


4.5 Stopping threshold:

This parameter controls the algorithm convergence on each run. bioNMF makes use of the convergence method described in [Brunet et. al, PNAS 2004].

Each 10 iterations, a connectivity matrix C of size M × M is computed, where M is the number of columns of matrix H. Each entry Cij in this matrix is set to 1 if column i and j in H have their maximum value for the same factor (ie. on the same row in H), and 0 otherwise.
If the connectivity matrix stop changing after a certain number of iterations (equals to the stopping threshold multiplied by 10), the matrices are considered as having converged and the algorithm stops the current run.

Note: This parameter has no effect if it is greater than the maximum number of iteration divided by 10.


4.6 Save options:

This parameter controls the mode on which output files are saved:


4.7 Heatmaps and profiles:

If this option is selected, bioNMF will generate heatmaps graphics and profile plots for input and output matrices.

If the input matrix contains negative data, bioNMF uses a red-green colormap to generate the heatmap. Otherwise, a blue-red scale is used. Heatmaps for output files always uses a blue-red colormap.


4.8 Analysis's output images:

If this option is selected, bioNMF will generate output images of the selected analysis-method:

See section 3. Analysis Method for details.


4.9 NMF Algorithms:

The NMF algorithm takes one input matrix (V) and returns two non-negative matrices (W and H) whose product is equal to V (ie. VW*H). bioNMF lets you choose between the following variants of the NMF algorithm:

  1. Standard: The classical algorithm.
  2. Divergence NMF: Variant derived from another cost function.
  3. Non smooth NMF: Reduces smoothness on data.

A) Standard:

This is the classical algorithm proposed by D. D. Lee and H. S. Seung (Nature, 1999).

Standard NMF algorithm


B) Divergence NMF:

Variant proposed by D. D. Lee and H. S. Seung from a divergence cost function (NIPS, 2001).

Divergence NMF algorithm


C) Non-smooth NMF:

Variant proposed by A. Pascual-Montano et al. (IEEE TPAMI, 2006; BMC Bioinformatics, 2006). The cost function is derived by introducing an extra smoothness matrix (S) in order to demand sparseness to data.

Nonsmooth NMF algorithm

The positive symmetric matrix SRK×K (where K is the current factorization rank) is a smoothing matrix defined as:

Smoothing matrix S

where I is the Identity matrix, 1 is a vector of ones, and the parameter θ satisfies 0 ≤ θ ≤ 1.

This parameter controls the sparsity level from 0 (smooth) to 1 (sparse). We recommend to use 0.5.






5. HOW TO CITE bioNMF

If you use this software, please cite the following work: