Last updated: 2011/05/17.
Please refresh your browser's cache (press F5 or Ctrl-R).

   
[bioNMF] [help]

WEB SERVICES USER GUIDE

The Web Service provides an API to make use of the services provided by the system in a programmatic way. The functionality mirrors that of the web page, but with some minor differences to simplify its use. The API is basically asynchronous. All the actual analysis methods launch jobs on the server that get registered, and an identifier is returned to the client so that it can query the status of the job and gather the results when the job is done.

The results for the job are provided in two ways, the main one only deals with the essential results for each job. The actual results that are produced depend on the analysis performed itself and the parameters used. To gather the results you first query the server for a list of results identifiers, with this identifiers you can gather the actual contents of the results. The description of the methods bellow go into details on what are these results specifically in each case.

When using the web interface, you are provided with more results than just the essentials, including images used to asses of validate these results. Because many times these results will probably not be the ones consumed directly by the clients software, but intended for user viewing, we don't provide them by the same means as the essential results. However, we provide a method to allow a client to download a file bundle in .tgz format, so these outputs are still accessible via the Web Service.


1. CONECTING TO THE SERVER

The easiest way to connect to the server is to load the bioNMF's WSDL file located at http://bionmf.dacya.ucm.es/2.0/WebService/BioNMFWS.wsdl. This will prepare the driver with all the functions in the API.

For example, this can be done in Ruby Language as follow:

require 'soap/wsdlDriver'
require 'fileutils'
require 'base64'
require 'yaml'

WSDL_URL = "http://bionmf.dacya.ucm.es/2.0/WebService/BioNMFWS.wsdl"

driver = SOAP::WSDLDriverFactory.new(WSDL_URL).create_rpc_driver

Strictly speaking, you just need the SOAP::WSDLDriver module to connect to the server. The rest of modules are required for further actions.






2. DATA FORMAT

IMPORTANT NOTES:

ASCII-text input files:

By default, bioNMF accepts input text files where data is separated by single TAB characters. It might also contain row labels and/or column headers, as well as a short description string at the beginning of the file (any of these three elements is optional). Each header, label or the description string might be composed by multiple space-separated words and/or numbers. If set, both column-headers and description string must be located at the first line.

Important Notes:

See examples of input text-file formats: with labels and without labels (if you use any of these matrices, please use one of the preprocessing methods listed below in order to set all their values as positive).

Under the following conditions, input data might be separated by SINGLE space characters (' '):

Binary input files:

In addition, bioNMF also accepts binary files encoded using IEEE little-endian byte ordering. Data must be written in the following format:

Please note that newline characters ('\n') are mandatory if any of headers, labels or the name fields is set. In addition, only UNIX (LF) and MS-DOS (CR+LF) end-of-line styles are accepted.
See an example of how to save a file in this format with this simple Matlab script.

Limited on-line version

In this on-line version, size and dimensions of acepted input matrices are limited to the following values:

Please note the following:






3. WEB SERVICE API

Performing analysis using the web service usually entails uploading and preprocessing a data matrix, calling any of the three analysis methods, and query the status until the job is finished. After that, you can retrieve the list of results, which are just the identifiers, and gather those you are interested on; or even retrieve the whole list of output files, which includes the results and other useful information for the user.

3.1 Upload input data:

upload_matrix: matrix, binary, column_headers, row_labels, transpose, normalization, positive, suggested_name => matrix_id

Uploads a matrix into the server and perform any of the selected preprocessing method (normalization or making data positive).

This method returns a string that represents an identifier for the input matrix. You can use this ID to launch one or more analysis process on this matrix (each process will use a private copy of the input data).

Input parameters: Returns:

3.2 Analysis methods:

bioNMF can be used to perform three types of analysis:

  1. Standard NMF: just executes the selected NMF algorithm.
  2. Bicluster Analysis: Clusters highly-related genes and samples.
  3. Sample Classification: Unsupervised classification method of experimental samples.

A) Standard NMF:

standardNMF: matrix_id, algorithm, kstart, kend, runs, iterations, stop, sparseness, sup_info, suggested_name => job_id

Launches a job to perform a Non-negative Matrix Factorization on the selected input matrix. Returns a string representing the job's identifier. This identifier can be used to query job status and to retrieve analysis results.

The result of this analysis is two matrices, W and H, corresponding to the best factorization rank (ie. rank of the most stable clustering) within a given input range. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability). See Analysis options for details.

Input parameters: Returns:

See a detailed description of input parameters and the CCC in the Analysis Options section.


B) Biclustering Analysis:

biclustering: matrix_id, algorithm, kstart, kend, runs, iterations, stop, sparseness, sup_info, suggested_name => job_id

Launches a job to perform a biclustering on the selected input matrix. Returns a string representing the job's identifier. You can use this ID number to query job status and retrieve analysis results.

This analysis method, proposed by Carmona-Saez et al. (BMC Bioinformatics, 2006), is intended mainly for Gene-Expression analysis, although its applications can be extended to other type of data. Taking gene expression as a case of study, this method groups genes and samples based on local features generating sets of samples and genes that are locally related. The result is a set of K biclusters (sub-matrices) encoding modular patterns, where K is the best factorization rank within a given input range. Each bicluster matrix contains the set of genes that are highly associated to a local pattern and samples sorted by its importance in this pattern.

The best factorization rank correspond to the one of the most stable clustering. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability). See Analysis options for details.

Input parameters: Returns:

See a detailed description of input parameters and the CCC in the Analysis Options section.


C) Sample Classification:

sampleClassification: matrix_id, algorithm, kstart, kend, runs, iterations, stop, sparseness, sup_info, suggested_name => job_id

Launches a job to perform an unsupervised classification method of experimental samples on the selected input matrix. Returns a string representing the job's identifier. You can use this ID number to query job status and retrieve analysis results.

This module implements the method proposed by Brunet et al. (PNAS 2004) to determine the most suitable number of sample clusters in a given dataset and to group the data samples into K clusters, being K the best factorization rank within a given input range. Results will be an estimation of the best number of clusters in the data set and the cluster assignments of each experimental condition.

The best factorization rank correspond to the one of the most stable clustering. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability). See Analysis options for details.

Input parameters: Returns:

See a detailed description of input parameters and the CCC in the Analysis Options section.


3.3 Querying job status and output messages:

With the following functions you can query the status of the current job, as well as job's output messages.

A) Job status

status: job_id => String

Returns a string with the status of the job.

Other generic states:


B) Information Messages

messages: job_id => Array of strings

These are messages generated by the functions. They are usually verbose descriptions of the job status it go through. If job has the "error" status, this function can show information about the nature of the error.


C) Checking if job is finished

done: job_id => boolean

Returns true if the job has finished with any of the "done", "error", or "aborted" states.

error: job_id => boolean

Returns true if the job has finished with the "error" status.


D) Job information

info: job_id => YALM structure

This method returns a YALM structure with information about the job, such as input parameters.


3.4 Getting job results:

After the analysis is finished, you can retrieve the list of results, which are just the identifiers, and gather those you are interested on; or you can retrieve the whole list of output files, which includes the results and other useful information for the user.

A) Result identifiers

results: job_id => Array of IDs

Given a identifier of a finished job (job_id), this method returns a vector with all identifiers of analysis outputs. Each of such IDs can be used to retrieve its correspondig output file with the method result.

According to the selected analysis methods, results can be (in this order):

Notes:


B) Retrieving a result

result: res_id => Output data (base64-encoded String!)

Returns output data referenced by res_id as a base64-encoded string. Depending on the input matrix, output data might be in binary or text format. To ensure that all data can be transfered as printable characters, it must be encoded in base 64. Therefore, the string returned from this function must be decoded before writing data to a file. For Ruby's soap4r returned data must be explicitly decoded with the Base64 module (see our example of client).


C) Get bundled results

bundle: job_id => tgz bundle of files (base64-encoded String!)

Depending on the analysis performed and provided parameters, the system will produce a number of files in addition to the ones returned by the result method. While theses files are not essential results of the task, they might be of interest for exploratory or assessment tasks. All these files can be accessed using this function. Note because it is a binary file, data is encoded in base64 to ensure all it is transfered as printable characters. The client then need to decode this string before writing it to a file using an appropriate function. For Ruby's soap4r returned data must be explicitly decoded with the Base64 module (see our example of client).

NOTE: This file is not stored on the server. It is generated each time you call this function with all files available at that moment.


3.5 Clean-up:

After your analysis has finished, and all output data has been downloaded, you can clean your data files to avoid wasting disk space in our sever :-)

A) Clean job files

clean_job_files: job_id

Removes all files from the given job.


B) Clean input matrix

clean_matrix: matrix_id

Removes an uploaded input matrix. Note that you will not be able to use this matrix for future analyses.







4. ANALYSIS OPTIONS

All analysis methods can be controlled by several parameters:

  1. Range of factorization ranks ( [ kstart ... kend ] ).
  2. Number of runs per rank.
  3. Number of iterations per run.
  4. Stopping threshold.
  5. NMF Algorithm.

4.1 Range of factorization ranks ( [ kstart ... kend ] ):

NMF decomposes your data into K clusters, being K (a.k.a. Factorization Rank) the inner dimension of the matrix product: W*H. bioNMF can find the best factorization rank within a given input range, by computing the Cophenetic Correlation Coefficient (CCC) for each of these ranks.

CCC is a quantitative measure of clustering stability. It is based on the Consensus clustering method which exploits the stochastic nature of the NMF algorithm (see Brunet et. al, PNAS 2004). Since the NMF algorithm is non-deterministic, its solutions might vary from run to run when executed with different random initial values for W and H. If the factorization is stable for a given value, K, it is expected that data assignments to these K clusters would vary little from run to run.

CCC values will vary from 1 (a perfect stability) to 0 (instability). The best factorization rank then corresponds to the one with the highest CCC value. This method is probably one of the most used methods in the field to estimate the best factorization rank.

On our own experience, a value of 100 runs per factorization rank (see 'Number of runs' parameter) is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].

This method is always used if a range of factorization ranks is supplied, or if the Sample Classification analysis method is selected.

Note: The maximum factorization rank allowed for this on-line version is 32 factors.


4.2 Number of runs:

Due to the non-deterministic nature of NMF, it may or may not converge to the same solution on each run depending on the random initial conditions. Therefore, executing the algorithm several times with different random initializations is a good approach for selecting the W and H matrices that best approximates the input matrix. Depending on the problem, more or less runs will be necessary to achieve an optimum solution. However, considering that the computational cost of this algorithm is very high, a limited number of runs is recommended. On our own experience, a value of 100 runs is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].

If you select a range of factorization ranks, bioNMF will try the specified number of runs for each rank in that range.

Note: The maximum number of runs allowed for this on-line version is 128 runs.


4.3 Number of iterations:

This parameter controls the maximum number of iterations per run to allow algorithm convergence. A maximum of 2000 iterations is enough on most cases.

Note: The maximum number of iterations allowed for this on-line version is 4096.


4.4 Stopping threshold:

This parameter controls the algorithm convergence on each run. bioNMF makes use of the convergence method described in [Brunet et. al, PNAS 2004].

Each 10 iterations, a connectivity matrix C of size M × M is computed, where M is the number of columns of matrix H. Each entry Cij in this matrix is set to 1 if column i and j in H have their maximum value for the same factor (ie. on the same row in H), and 0 otherwise.
If the connectivity matrix stop changing after a certain number of iterations (equals to the stopping threshold multiplied by 10), the matrices are considered as having converged and the algorithm stops the current run.

Note: This parameter has no effect if it is greater than the maximum number of iteration divided by 10.


4.5 NMF Algorithms:

bioNMF lets you choose between the following variants of the NMF algorithm:

  1. Standard: The classical algorithm.
  2. Divergence NMF: Variant derived from another cost function.
  3. Non smooth NMF: Reduces smoothness on data.

A) Standard:

This is the classical algorithm proposed by D. D. Lee and H. S. Seung (Nature, 1999).

Standard NMF algorithm


B) Divergence NMF:

Variant proposed by D. D. Lee and H. S. Seung from a divergence cost function (NIPS, 2001).

Divergence NMF algorithm


C) Non Smooth NMF:

Variant proposed by A. Pascual-Montano et al. (IEEE TPAMI, 2006; BMC Bioinformatics, 2006). The cost function is derived by introducing an extra smoothness matrix (S) in order to demand sparseness to data.

Nonsmooth NMF algorithm

The positive symmetric matrix SRK×K (where K is the current factorization rank) is a smoothing matrix defined as:

Smoothing matrix S

where I is the Identity matrix, 1 is a vector of ones, and the parameter θ satisfies 0 ≤ θ ≤ 1.

This parameter controls the sparsity level from 0 (smooth) to 1 (sparse). We recommend to use 0.5.






5. EXAMPLE OF USE

Here you can find an example script implemented in Ruby:

http://bionmf.dacya.ucm.es/2.0/WebService/bioNMFWS_client.rb

Usage:

$> ruby bioNMFWS_client.rb <input-matrix_filename>


Output files will be saved to directory:   "output_<input-matrix_filename>".

Please consider downloading and using these matrices for testing:

WARNING: This test works ONLY with ASCII-text files with non-numeric labels (or with no labels at all) like the ones above. To use it with binary files, please edit this file and set to true the second argument of the function call to upload_matrix. To use numeric column or row labels, please set to true the third and fourth arguments (respectively) in the same function call.






6. HOW TO CITE bioNMF

If you use this software, please cite the following work: