# HELP

*bioNMF* is a web-based tool designed to provide a set of analysis methods using the *Non-negative Matrix Factorization* (*NMF*) model, widely used in the last few years by the biomedical community due to its intuitive nature, since it has the ability to extract highly-interpretable additive parts from datasets, reducing therefore its dimensionality.

This online tool provides a user-friendly interface, combined with a computational efficient parallel implementation of the NMF algorithm to explore data in different analysis scenarios:

The NMF algorithm takes one input matrix (**V**) and returns two non-negative matrices (**W** and **H**) whose product is equal to **V** (ie. **V** ∼ **W*****H**). See NMF Algorithms for more information.

If matrix **V** has *N* rows and *M* columns, then dimensions for output matrices **W** and **H** will be *N × K* and *K × M* respectively, where *K* is the selected factorization rank. This value can be specified by the user or *bioNMF* can find the best one within a given input range, by computing the *Cophenetic Correlation Coefficient* (*CCC*) for each of these ranks. See Analysis Options for more information.

In addition to the online access, *bioNMF* also provides the same functionality included in the web site as a public Web Services interface, enabling users with more computer expertise to launch jobs into the bioNMF server from their own scripts and workflows.

#### If you use this software, please cite the following work:

### Index:

- Uploading data.
- Data preprocessing.
- Analysis method.
- Analysis options.

# 1. UPLOADING DATA

**IMPORTANT NOTES:**

- Please read instructions below about valid text and binary file formats.
**No **`.xls`

, `.mat`

nor `.cvs`

files are accepted.
- Allowed characters for filenames are:
`'A'`

-`'Z'`

, `'a'`

-`'z'`

, `'0'`

-`'9'`

, `'_'`

, `'-'`

, `'.'`

, and `'+'`

.
**Therefore, filenames must not have characters such as space, TAB or tilde (**`'˜'`

).

### ASCII-text input files:

By default, *bioNMF* accepts input text files where data are separated by **single TAB characters**. It might also contain row labels and/or column headers, as well as a short description string at the beginning of the file (any of these three elements is optional). Each header, label or the description string might be composed by multiple **space-separated** words and/or numbers. If set, both column-headers and description string must be located at the first line.

**Important Notes:**

- Please,
**do NOT use single or double quote characters** on any labels.
**All rows and columns must be filled with data** (either numbers, labels or TAB characters). **Consecutive TAB characters will be processed as empty values** and will be replaced by 0 or empty labels.
- Please
**avoid TAB characters at the end of line**. They will be processed as an empty column.
- Please use a
**dot (**`'.'`

) as a decimal symbol, not a comma.
- Only UNIX (
`LF`

, `'\n'`

) and MS Windows (`CR`

+`LF`

, `'\r\n'`

) end-of-line styles are accepted.

See examples of valid text-file formats: with labels and without labels (if you use any of these matrices, please use one of the preprocessing methods listed below in order to set all their values as positive).

Under the following conditions, input data might be separated by **SINGLE space characters** (`' '`

):

- Each matrix label is a single word (including the description string).
- There must not be any TAB character in the file.
- Consecutive space characters will be processed as empty values (ie. as 0) or empty labels.
- Space characters at the end of line will be processed as an empty column.
- First and second rows must not have empty labels since they will be processed as empty numeric values (ie. as 0). In such case, please select the Data-matrix-contains-numeric-row-labels checkbox to avoid.

### Binary input files:

In addition, bioNMF also accepts binary files encoded using IEEE little-endian byte ordering. **Data must be written in the following format**:

- Number of rows (
*N*): a 32-bits signed integer.
- Number of columns (
*M*): a 32-bits signed integer.
- Data matrix:
*N*M* 64-bits floating-point values stored in row-major order (ie. contiguous elements belong to the same row).
- Optionally, headers, labels or a description string can be written as follows:
- Row labels:
*N* tab-separated strings in ASCII-text format.
- A newline (
`'\n'`

) character (in ASCII-text format).
- Column headers:
*M* tab-separated strings in ASCII-text format.
- A newline (
`'\n'`

) character (in ASCII-text format).
*Name* (the short description string): String in ASCII-text format.

Please note that newline characters (

`'\n'`

) are mandatory if any of headers, labels or the

*name* fields is set. In addition, only UNIX (

`LF`

) and MS-DOS (

`CR`

+

`LF`

) end-of-line styles are accepted.

See an example of how to save a file in this format with this simple

Matlab script.

### Input file options

To select your input file format (text or binary), as well as some additional options, just click on the '**Input file options**' menu.

- For input text-files, select '
**Data matrix contains numeric column headers**' and/or '**Data matrix contains numeric row labels**' in case your data matrix contains numeric headers (including the description string) and/or numeric row labels.

### Limited on-line version

In this on-line version, size and dimensions of accepted input matrices are limited to the following values:

- 32768 rows.
- 1024 columns.
- 4,194,304 items (i.e., 2
^{22}).

Please note the following:

- Your input matrix must not exceed any of the three limits. In this way, maximum valid matrix dimensions ranges from 4096 rows × 1024 columns to 32768 rows × 128 columns.
- These limits are applied
**after** data-matrix transposing (if selected).
- Limit values do not include the optional column and row labels.

# 2. DATA PREPROCESSING

Even if NMF seems to be an especially robust algorithm compared with other clustering methods like hierarchical clustering or SOM, a previous normalization step is may be convenient to make more evident the patterns of interest. Therefore, data normalization is provided as a preprocessing step before computing the factorization. To select a preprocessing method, just click on the '**Preprocessing options**' menu.

Before data normalization, you might transpose your input data matrix in case your data items are stored as input data rows (select '**Transpose data for analysis**').

Available normalization methods are:

**No normalization** (this is the default option, data won't be normalized)
**Subtract global mean**: The global mean of the data matrix is computed and then subtracted from all data items.
**Scale columns, then normalize rows**: This is the approach proposed by Getz, *et al.* (*PNAS 2000*) that first divide each column by its mean and then normalize each row.
**mean=0, std=1 by rows**: Each row of the data matrix is transformed in such a way that its mean will be zero and its standard deviation will be 1.
**mean=0, std=1 by columns**: Each column of the data matrix is transformed in such a way that its mean will be zero and its standard deviation will be 1.
**Subtract mean by rows**: The mean for each row of the data matrix is calculated and then subtracted from all data items of that row.
**Subtract mean by columns**: The mean for each column of the data matrix.is calculated and then subtracted from all data items of that column.
**Subtract mean by rows and then by columns**: The mean for each row of the data matrix is calculated and then subtracted from all data items of that row. In a subsequent step, the mean for each column of the data matrix is calculated and then subtracted from all data items of that column.

After the normalization step, or if the data originally contains negative values, this tool offers a set of methodologies to make it positive:

**Don't do anything**: This is the default option, data are expected to be positive.
**Subtract absolute minimum**: This a very simple method to make positive data. The minimum negative value is subtracted to every single cell of the data matrix.
**Fold data by rows**: This approach was used by Kim and Tidor (*Genome Res. 2003*) for the analysis of log-transformed gene expression data. Every row (item) is represented in two new rows of a new matrix. The first one is used to indicate positive expression (up-regulation) and the second one to indicate a negative expression value (down-regulation). This process doubles the number of rows of the data set.
**Fold data by columns**: Similar to the above case but this option makes the data positive by folding columns (variables).
**Exponential scaling: **Data are exponentially scaled to make them positive. This is an inverse operation of a logarithmic transformation.

**Note:** **Fold data by rows** and

**Fold data by columns** are reserved for

Standard-NMF analysis method.

# 3. ANALYSIS METHOD

*bioNMF* can be used to perform three types of analysis:

- Standard NMF: just executes the selected NMF algorithm.
- Bicluster Analysis: Clusters highly-related genes and samples.
- Sample Classification: Unsupervised classification method of experimental samples.

## 3.1 Standard NMF:

This module performs a non-negative matrix factorization of the input data. It returns the resulting **W** and **H** matrices corresponding to the best factorization rank (ie. rank of the most stable clustering) within a given input range.

A quantitative measure of cluster stability is provided by the *Cophenetic Correlation Coefficient* (*CCC*). This coefficient varies from 1 (a perfect stability) to 0 (instability).

**Note:** By default, this module executes the 'Standard' NMF variant.

See a description of CCC, available NMF algorithms and other input parameters in the Analysis Options section.

### Output files:

Factors |
matrix **W** (with labels if present in the input matrix). |

Encoding vectors |
matrix **H** (with labels if present in the input matrix). |

Cophenetic coefficients |
Vector of Cophenetic Correlation Coefficients for the given range of factorization ranks (only if a range is supplied). |

**Note:** If the input matrix contains labels, a copy without labels (ie. a numeric-only matrix) of each output matrix is also saved.

See an example of analysis output here.

## 3.2 Biclustering Analysis:

This analysis method, proposed by Carmona-Saez *et al.* (BMC Bioinformatics, 2006), is intended mainly for Gene-Expression analysis, although its applications can be extended to other type of data. Taking gene expression as a case of study, this method groups genes and samples based on local features generating sets of samples and genes that are locally related. The result is a set of *K* biclusters (sub-matrices) encoding modular patterns, where *K* is the best factorization rank within a given input range. Each bicluster matrix contains the set of genes that are highly associated to a local pattern and samples sorted by its importance in this pattern.

The best factorization rank correspond to the most stable clustering. A quantitative measure of this stability is provided by the *Cophenetic Correlation Coefficient* (*CCC*). This coefficient varies from 1 (a perfect stability) to 0 (instability).

**Note:** By default, this module executes the 'Non-smooth"' NMF variant.

See a description of CCC, available NMF algorithms and other input parameters in the Analysis Options section.

### Output files:

Factors |
matrix **W** (with labels if present in the input matrix). |

Encoding vectors |
matrix **H** (with labels if present in the input matrix). |

Cophenetic coefficients |
Vector of Cophenetic Correlation Coefficients for the given range of factorization ranks (only if a range is supplied). |

Additionally, for each bicluster found:

Bicluster data |
Bicluster submatrix |

Bicluster row-indexes |
Contains the row indexes of the extracted biclusters |

Bicluster column-indexes |
Contains the column indexes of the extracted biclusters |

**Note:** If the input matrix contains labels, a copy without labels (ie. a numeric-only matrix) of each output matrix is also saved.

The bicluster matrix contains the set of genes that are highly associated to a local pattern and samples sorted by its importance in this pattern. An image of the heatmap of each bicluster can be optionally generated (see save output images).

In this example a heatmap with the subset of genes and all samples (sorted by their association to the local pattern) is shown on the bottom part of the image. The plot on the upper part represents the coefficients of all samples in the corresponding row of matrix

**H**. Marked in blue are samples that show the largest coefficient for that factor while in green are marked samples that show largest coefficients in other rows in

**H**. More details can be found in (

Carmona-Saez et al. 2006). The consistency value represents the percentage of times that a bicluster is found in all the runs of the algorithm (see

number of runs). For this purpose, two biclusters are assumed to be similar if they share at least 75% of their genes. Finally, a bicluster is considered to be enough

*consistent* if it is found in 80% or more of the factorizations.

See an example of analysis output here.

## 3.3 Sample Classification:

This module implements the method proposed by Brunet *et al.* (PNAS 2004) to determine the most suitable number of sample clusters in a given dataset and to group the data samples into *K* clusters, being *K* the best factorization rank within a given input range. Results will be an estimation of the best number of clusters in the data set and the cluster assignments of each experimental condition.

The best factorization rank correspond to the one of the most stable clustering. A quantitative measure of this stability is provided by the *Cophenetic Correlation Coefficient* (*CCC*). This coefficient varies from 1 (a perfect stability) to 0 (instability).

**Note:** By default, this module executes the 'Divergence' NMF variant.

See a description of CCC, available NMF algorithms and other input parameters in the Analysis Options section.

### Output files:

Factors |
matrix **W** (with labels if present in the input matrix). |

Encoding vectors |
matrix **H** (with labels if present in the input matrix). |

Cophenetic coefficients |
Vector of Cophenetic Correlation Coefficients for the given range of factorization ranks. |

Cluster IDs |
For each factorization rank: a file indicating in which class was assigned each *sample* (i.e., each column of the input matrix). |

A graphical output of sample classifications can be optionally generated (see show output images). It shows the reordered consensus matrix and cophenetic correlation coefficient computed for each rank used in the analysis.

See an example of analysis output here.

# 4. ANALYSIS OPTIONS

All analysis methods can be controlled by several parameters:

- Range of factorization ranks.
- Number of runs per rank.
- User initialization files.
- Number of iterations per run.
- Stopping threshold.
- Saving mode of output files.
- Heatmaps and profile plots.
- Analysis's output images.
- NMF Algorithm.

## 4.1 Range of factorization ranks:

NMF decomposes your data into *K* clusters, being *K* (a.k.a. *Factorization Rank*) the inner dimension of the matrix product: **W*****H**. *bioNMF* can find the best factorization rank within a given input range, by computing the *Cophenetic Correlation Coefficient* (*CCC*) for each of these ranks.

CCC is a quantitative measure of clustering stability. It is based on the *Consensus clustering* method which exploits the stochastic nature of the NMF algorithm (see Brunet *et. al*, PNAS 2004). Since the NMF algorithm is non-deterministic, its solutions might vary from run to run when executed with different random initial values for **W** and **H**. If the factorization is stable for a given value, *K*, it is expected that data assignments to these *K* clusters would vary little from run to run.

CCC values will vary from 1 (a perfect stability) to 0 (instability). The best factorization rank then corresponds to the one with the highest CCC value. This method is probably one of the most used methods in the field to estimate the best factorization rank.

On our own experience, a value of 100 runs per factorization rank (see 'Number of runs' parameter) is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].

This method is always used if a range of factorization ranks is supplied, or if the Sample Classification analysis method is selected.

**Notes:**

- The maximum factorization rank allowed for this on-line version is 32 factors.
- If initialization files are provided, only one factorization rank can be selected.

## 4.2 Number of runs:

Due to the non-deterministic nature of NMF, it may or may not converge to the same solution on each run depending on the random initial conditions. Therefore, executing the algorithm several times with different random initializations is a good approach for selecting the **W** and **H** matrices that best approximates the input matrix. Depending on the problem, more or less runs will be necessary to achieve an optimum solution. However, considering that the computational cost of this algorithm is very high, a limited number of runs is recommended. On our own experience, a value of 100 runs is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].

If you select a range of factorization ranks, *bioNMF* will try the specified number of runs for each rank in that range.

**Notes:**

- The maximum number of runs allowed for this on-line version is 128 runs.
- If initialization files are provided, only one factorization rank can be selected.

## 4.3 User initialization files

Optionally, you can supply initial values for the output matrices, **W** and **H**, instead of initialize them with random values. This will considerably reduce the analysis time since you do not have to run the NMF algorithm more than once.

If you provide these files, the following analysis options will be disabled and set to a fixed value:

**IMPORTANT NOTES:**

- Restrictions on input-filename format also apply here.
**Initialization files must follow the same format of your input matrix** (eg. if input matrix is a binary file, initialization files must be binary as well).
**They must be numeric-only matrices.** Initialization files must not have labels.
**Matrix dimensions must be compatible with your input matrix.** If input matrix has *N* rows and *M* columns, then dimensions of initial **W** and **H** matrices must be *N × K* and *K × M* respectively, where *K* is the selected factorization rank.
- If input-matrix dimensions are modified by any normalization or transformation method, Initialization Files will
**NOT** be updated. They must **already** fit new input-matrix dimensions.

## 4.4 Number of iterations:

This parameter controls the maximum number of iterations per run to allow algorithm convergence. A maximum of 2000 iterations is enough on most cases.

**Note:** The maximum number of iterations allowed for this on-line version is 4096.

## 4.5 Stopping threshold:

This parameter controls the algorithm convergence on each run. *bioNMF* makes use of the convergence method described in [Brunet *et. al*, PNAS 2004].

Each 10 iterations, a connectivity matrix **C** of size *M × M* is computed, where *M* is the number of columns of matrix **H**. Each entry *Cij* in this matrix is set to 1 if column *i* and *j* in **H** have their maximum value for the same factor (ie. on the same row in **H**), and 0 otherwise.

If the connectivity matrix stop changing after a certain number of iterations (equals to the stopping threshold multiplied by 10), the matrices are considered as having converged and the algorithm stops the current run.

**Note:** This parameter has no effect if it is greater than the maximum number of iteration divided by 10.

## 4.6 Save options:

This parameter controls the mode on which output files are saved:

**Save best matrices (default):** Saves only the best **W** and **H** matrices in single files. Selected by default if the number of runs per rank is equal to 1.
**Save all matrices:** Saves all **W** and **H** matrices corresponding to all runs in single files. Best output matrices are also saved.
**Combine matrices in a single file:** Combines all **W** matrices and all **H** matrices corresponding to all runs in two files (see Chagoyen *et al.*, 2006). Best output matrices are also saved in single files.

## 4.7 Heatmaps and profiles:

If this option is selected, *bioNMF* will generate heatmaps graphics and profile plots for input and output matrices.

If the input matrix contains negative data, *bioNMF* uses a red-green colormap to generate the heatmap. Otherwise, a blue-red scale is used. Heatmaps for output files always uses a blue-red colormap.

## 4.8 Analysis's output images:

If this option is selected, *bioNMF* will generate output images of the selected analysis-method:

- Sample Classification: It shows the reordered consensus matrix and cophenetic correlation coefficient computed for each rank used in the analysis.
- Biclustering Analysis: It shows a heatmap with the subset of genes and all samples sorted by its association to the local pattern, and a plot representing the coefficients of all samples in the corresponding row of matrix
**H**.

See section

3. Analysis Method for details.

## 4.9 NMF Algorithms:

The NMF algorithm takes one input matrix (**V**) and returns two non-negative matrices (**W** and **H**) whose product is equal to **V** (ie. **V** ∼ **W*****H**). *bioNMF* lets you choose between the following variants of the NMF algorithm:

- Standard: The classical algorithm.
- Divergence NMF: Variant derived from another cost function.
- Non smooth NMF: Reduces smoothness on data.

### A) Standard:

This is the classical algorithm proposed by D. D. Lee and H. S. Seung (Nature, 1999).

### B) Divergence NMF:

Variant proposed by D. D. Lee and H. S. Seung from a divergence cost function (NIPS, 2001).

### C) Non-smooth NMF:

Variant proposed by A. Pascual-Montano *et al.* (IEEE TPAMI, 2006; BMC Bioinformatics, 2006). The cost function is derived by introducing an extra smoothness matrix (**S**) in order to demand sparseness to data.

The positive symmetric matrix **S**∈**R**^{K×K} (where *K* is the current factorization rank) is a *smoothing* matrix defined as:

where

**I** is the Identity matrix,

**1** is a vector of ones, and the parameter θ satisfies 0 ≤ θ ≤ 1.

This parameter controls the sparsity level from 0 (smooth) to 1 (sparse). We recommend to use 0.5.