require 'soap/wsdlDriver'
require 'fileutils'
require 'base64'
require 'yaml'
WSDL_URL = "http://bionmf.dacya.ucm.es/2.0/WebService/BioNMFWS.wsdl"
driver = SOAP::WSDLDriverFactory.new(WSDL_URL).create_rpc_driver
The Web Service provides an API to make use of the services provided by the system in a programmatic way. The functionality mirrors that of the web page, but with some minor differences to simplify its use. The API is basically asynchronous. All the actual analysis methods launch jobs on the server that get registered, and an identifier is returned to the client so that it can query the status of the job and gather the results when the job is done.
The results for the job are provided in two ways, the main one only deals with the essential results for each job. The actual results that are produced depend on the analysis performed itself and the parameters used. To gather the results you first query the server for a list of results identifiers, with this identifiers you can gather the actual contents of the results. The description of the methods bellow go into details on what are these results specifically in each case.
When using the web interface, you are provided with more results than just the essentials, including images used to asses of validate these results. Because many times these results will probably not be the ones consumed directly by the clients software, but intended for user viewing, we don't provide them by the same means as the essential results. However, we provide a method to allow a client to download a file bundle in .tgz
format, so these outputs are still accessible via the Web Service.
The easiest way to connect to the server is to load the bioNMF's WSDL file located at http://bionmf.dacya.ucm.es/2.0/WebService/BioNMFWS.wsdl. This will prepare the driver with all the functions in the API.
For example, this can be done in Ruby Language as follow:
require 'soap/wsdlDriver'
require 'fileutils'
require 'base64'
require 'yaml'
WSDL_URL = "http://bionmf.dacya.ucm.es/2.0/WebService/BioNMFWS.wsdl"
driver = SOAP::WSDLDriverFactory.new(WSDL_URL).create_rpc_driver
Strictly speaking, you just need the SOAP::WSDLDriver
module to connect to the server. The rest of modules are required for further actions.
IMPORTANT NOTES:
.xls
, .mat
nor .cvs
files are accepted.'A'
-'Z'
, 'a'
-'z'
, '0'
-'9'
, '_'
, '-'
, '.'
, and '+'
.
Therefore, filenames must not have characters such as space, TAB or tilde ('˜'
).By default, bioNMF accepts input text files where data is separated by single TAB characters. It might also contain row labels and/or column headers, as well as a short description string at the beginning of the file (any of these three elements is optional). Each header, label or the description string might be composed by multiple space-separated words and/or numbers. If set, both column-headers and description string must be located at the first line.
Important Notes:
'.'
) as a decimal symbol, not a comma.LF
, '\n'
) and MS Windows (CR
+LF
, '\r\n'
) end-of-line styles are accepted.See examples of input text-file formats: with labels and without labels (if you use any of these matrices, please use one of the preprocessing methods listed below in order to set all their values as positive).
Under the following conditions, input data might be separated by SINGLE space characters (' '
):
In addition, bioNMF also accepts binary files encoded using IEEE little-endian byte ordering. Data must be written in the following format:
'\n'
) character (in ASCII-text format).'\n'
) character (in ASCII-text format).'\n'
) are mandatory if any of headers, labels or the name fields is set. In addition, only UNIX (LF
) and MS-DOS (CR
+LF
) end-of-line styles are accepted.In this on-line version, size and dimensions of acepted input matrices are limited to the following values:
Performing analysis using the web service usually entails uploading and preprocessing a data matrix, calling any of the three analysis methods, and query the status until the job is finished. After that, you can retrieve the list of results, which are just the identifiers, and gather those you are interested on; or even retrieve the whole list of output files, which includes the results and other useful information for the user.
upload_matrix: matrix, binary, column_headers, row_labels, transpose, normalization, positive, suggested_name
=> matrix_id
Uploads a matrix into the server and perform any of the selected preprocessing method (normalization or making data positive).
This method returns a string that represents an identifier for the input matrix. You can use this ID to launch one or more analysis process on this matrix (each process will use a private copy of the input data).
Input parameters:matrix
: A Base64-encoded string
representing an input matrix in the format described above. Note this parameter is not a filename. Client programs must explicitly open an input file, reads its content, encode data to base64, and call this function with the resulting string. Encoding data to base64 ensures that all data is transfered as printable characters. This is particulary useful if your input matrix is in binary format. In Ruby, data can be encoded with the Base64 module (see our example of client).binary
: Boolean. Specifies if the input matrix is in binary format (true
) or in ASCII-text format (false
, default).
column_headers
: Boolean. Forces the system to consider the first line of the matrix as column headers (and/or the description string). The default is false
.row_labels
: Boolean. Forces the system to consider the first column of the matrix as a row labels. The default is false
.transpose
: Boolean. Specifies if the input matrix must be transposed. The default is false
.
normalization
: A string to specify the normalization method that the matrix must undergo:
"No"
(default): No normalization method is performed."SubGMean"
: Subtracts the global mean. The global mean of the data matrix is computed and then subtracted from all data items."SColsNRows"
: Scales columns, then normalize rows. This is the approach proposed by Getz, et al. (PNAS 2000) that first divide each column by its mean and then normalize each row."SDRows"
: mean=0, std=1 by rows. Each row of the data matrix is transformed in such a way that its mean will be 0 and its standard deviation will be 1."SDCols"
: mean=0, std=1 by columns. Each column of the data matrix is transformed in such a way that its mean will be 0 and its standard deviation will be 1."SubMRows"
: Subtracts mean by rows. The mean for each row of the data matrix is calculated and then subtracted from all data items of that row."SubMCols"
: Subtracts mean by columns. The mean for each column of the data matrix is calculated and then subtracted from all data items of that column."SubMRowsCols"
: Subtracts mean by rows and then by columns. The mean for each row of the data matrix is calculated and then subtracted from all data items of that row. In a subsequent step, the mean for each column of the data matrix is calculated and then subtracted from all data items of that column.positive
: Method to make data positive.
"No"
(default): No transformation method is performed."SubMin"
: Subtracts the absolute minimum. This a very simple method to make positive data. The minimum negative value is subtracted to every single cell of the data matrix."FoldRows"
: Fold data by rows. This approach was used by Kim and Tidor (Genome Res. 2003) for the analysis of log-transformed gene expression data. Every row (item) is represented in two new rows of a new matrix. The first one is used to indicate positive expression (up-regulation) and the second one to indicate a negative expression value (down-regulation). This process doubles the number of rows of the data set."FoldCols"
: Fold data by columns. Similar to the above case but this option makes the data positive by folding columns (variables)."ExpScal"
: Exponential scaling. Data is exponentially scaled to make it positive. This is an inverse operation of a logarithmic transformation.Note: "FoldRows"
and "FoldCols"
are reserved for Standard-NMF analysis method.
suggested_name
: Prefered Matrix ID. This is just a hint. Default: ""
matrix_id
: Matrix ID in the following format: "date-
time [-suggested_name
] [-
number]".bioNMF can be used to perform three types of analysis:
standardNMF: matrix_id, algorithm, kstart, kend, runs, iterations, stop, sparseness, sup_info, suggested_name
=> job_id
Launches a job to perform a Non-negative Matrix Factorization on the selected input matrix. Returns a string representing the job's identifier. This identifier can be used to query job status and to retrieve analysis results.
The result of this analysis is two matrices, W and H, corresponding to the best factorization rank (ie. rank of the most stable clustering) within a given input range. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability). See Analysis options for details.
Input parameters:matrix_id
: String. Input-matrix ID returned by the upload_matrix
function.algorithm
: String. NMF algorithm to be executed. Default: "Standard"
.kstart
, kend
: Integers. Range of factorization ranks. Default values: 2
for the former, and <kstart
> for the later.runs
: Integer. Number of runs per factorization rank.iterations
: Integer. Number of iterations per run. Default value: 2000.stop
: Integer. See Stopping threshold.sparseness
: Floating-point number. Parameter used by the nsNMF algorithm to control the sparsity level, from 0 (smooth) to 1 (sparse). This value is ignored by other NMF algorithms. Default: 0.5.sup_info
: Boolean. Generates supplementary output files:
bundle
function.
suggested_name
: Prefered job ID. This is just a hint. Default: ""
job_id
: Job ID in the following format: "date-
time [-suggested_name
] [-
number]".See a detailed description of input parameters and the CCC in the Analysis Options section.
biclustering: matrix_id, algorithm, kstart, kend, runs, iterations, stop, sparseness, sup_info, suggested_name
=> job_id
Launches a job to perform a biclustering on the selected input matrix. Returns a string representing the job's identifier. You can use this ID number to query job status and retrieve analysis results.
This analysis method, proposed by Carmona-Saez et al. (BMC Bioinformatics, 2006), is intended mainly for Gene-Expression analysis, although its applications can be extended to other type of data. Taking gene expression as a case of study, this method groups genes and samples based on local features generating sets of samples and genes that are locally related. The result is a set of K biclusters (sub-matrices) encoding modular patterns, where K is the best factorization rank within a given input range. Each bicluster matrix contains the set of genes that are highly associated to a local pattern and samples sorted by its importance in this pattern.
The best factorization rank correspond to the one of the most stable clustering. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability). See Analysis options for details.
Input parameters:matrix_id
: String. Input-matrix ID returned by the upload_matrix
function.algorithm
: String. NMF algorithm to be executed. Default: "Standard"
.kstart
, kend
: Integers. Range of factorization ranks. Default values: 2
for the former, and <kstart
> for the later.runs
: Integer. Number of runs per factorization rank.iterations
: Integer. Number of iterations per run. Default value: 2000.stop
: Integer. See Stopping threshold.sparseness
: Floating-point number. Parameter used by the nsNMF algorithm to control the sparsity level, from 0 (smooth) to 1 (sparse). This value is ignored by other NMF algorithms. Default: 0.5.sup_info
: Boolean. Generates supplementary output files:
bundle
function.
suggested_name
: Prefered job ID. This is just a hint. Default: ""
job_id
: Job ID in the following format: "date-
time [-suggested_name
] [-
number]".See a detailed description of input parameters and the CCC in the Analysis Options section.
sampleClassification: matrix_id, algorithm, kstart, kend, runs, iterations, stop, sparseness, sup_info, suggested_name
=> job_id
Launches a job to perform an unsupervised classification method of experimental samples on the selected input matrix. Returns a string representing the job's identifier. You can use this ID number to query job status and retrieve analysis results.
This module implements the method proposed by Brunet et al. (PNAS 2004) to determine the most suitable number of sample clusters in a given dataset and to group the data samples into K clusters, being K the best factorization rank within a given input range. Results will be an estimation of the best number of clusters in the data set and the cluster assignments of each experimental condition.
The best factorization rank correspond to the one of the most stable clustering. A quantitative measure of this stability is provided by the Cophenetic Correlation Coefficient (CCC). This coefficient varies from 1 (a perfect stability) to 0 (instability). See Analysis options for details.
Input parameters:matrix_id
: String. Input-matrix ID returned by the upload_matrix
function.algorithm
: String. NMF algorithm to be executed. Default: "Standard"
.kstart
, kend
: Integers. Range of factorization ranks. Default values: 2
for the former, and <kstart
> for the later.runs
: Integer. Number of runs per factorization rank.iterations
: Integer. Number of iterations per run. Default value: 2000.stop
: Integer. See Stopping threshold.sparseness
: Floating-point number. Parameter used by the nsNMF algorithm to control the sparsity level, from 0 (smooth) to 1 (sparse). This value is ignored by other NMF algorithms. Default: 0.5.sup_info
: Boolean. Generates supplementary output files:
bundle
function.
suggested_name
: Prefered job ID. This is just a hint. Default: ""
job_id
: Job ID in the following format: "date-
time [-suggested_name
] [-
number]".See a detailed description of input parameters and the CCC in the Analysis Options section.
With the following functions you can query the status of the current job, as well as job's output messages.
status: job_id
=> String
Returns a string with the status of the job.
"writing"
: Uploading and saving input matrix into the server."preprocessing"
: Launching the preprocessing step."job_queued"
: Executing the selected function (ie., preprocessing step or the selected analysis)."postprocess"
: Generating additional outputs for the selected analysis method."output"
: Generating bundle file with additional outputs from the analysis process (see bundle
function)."cleaning"
: Cleaning job or matrix files. See methods for clean-up.Other generic states:
"prepared"
: Job is ready to be submited."queued"
: Executing the selected task."aborted"
: Job has been aborted."error"
: Job finished with error."done"
: Job finished successfully.messages: job_id
=> Array of strings
These are messages generated by the functions. They are usually verbose descriptions of the job status it go through. If job has the "error"
status
, this function can show information about the nature of the error.
done: job_id
=> boolean
Returns true
if the job has finished with any of the "done"
, "error"
, or "aborted"
states
.
error: job_id
=> boolean
Returns true
if the job has finished with the "error"
status
.
info: job_id
=> YALM
structure
This method returns a YALM
structure with information about the job, such as input parameters.
After the analysis is finished, you can retrieve the list of results, which are just the identifiers, and gather those you are interested on; or you can retrieve the whole list of output files, which includes the results and other useful information for the user.
results: job_id
=> Array of IDs
Given a identifier of a finished job (job_id
), this method returns a vector with all identifiers of analysis outputs. Each of such IDs can be used to retrieve its correspondig output file with the method result
.
According to the selected analysis methods, results can be (in this order):
bundle
function you can download a numeric copy (i.e., without labels) of such files.result: res_id
=> Output data (base64-encoded String
!)
Returns output data referenced by res_id
as a base64-encoded string
.
Depending on the input matrix, output data might be in binary or text format. To ensure that all data can be transfered as printable characters, it must be encoded in base 64. Therefore, the string returned from this function must be decoded before writing data to a file. For Ruby's soap4r returned data must be explicitly decoded with the Base64 module (see our example of client).
bundle: job_id
=> tgz
bundle of files (base64-encoded String
!)
Depending on the analysis performed and provided parameters, the system will produce a number of files in addition to the ones returned by the result
method. While theses files are not essential results of the task, they might be of interest for exploratory or assessment tasks. All these files can be accessed using this function. Note because it is a binary file, data is encoded in base64 to ensure all it is transfered as printable characters. The client then need to decode this string before writing it to a file using an appropriate function. For Ruby's soap4r returned data must be explicitly decoded with the Base64 module (see our example of client).
NOTE: This file is not stored on the server. It is generated each time you call this function with all files available at that moment.
After your analysis has finished, and all output data has been downloaded, you can clean your data files to avoid wasting disk space in our sever :-)
clean_job_files: job_id
Removes all files from the given job.
clean_matrix: matrix_id
Removes an uploaded input matrix. Note that you will not be able to use this matrix for future analyses.
All analysis methods can be controlled by several parameters:
[ kstart ... kend ]
).[ kstart ... kend ]
):NMF decomposes your data into K clusters, being K (a.k.a. Factorization Rank) the inner dimension of the matrix product: W*H. bioNMF can find the best factorization rank within a given input range, by computing the Cophenetic Correlation Coefficient (CCC) for each of these ranks.
CCC is a quantitative measure of clustering stability. It is based on the Consensus clustering method which exploits the stochastic nature of the NMF algorithm (see Brunet et. al, PNAS 2004). Since the NMF algorithm is non-deterministic, its solutions might vary from run to run when executed with different random initial values for W and H. If the factorization is stable for a given value, K, it is expected that data assignments to these K clusters would vary little from run to run.
CCC values will vary from 1 (a perfect stability) to 0 (instability). The best factorization rank then corresponds to the one with the highest CCC value. This method is probably one of the most used methods in the field to estimate the best factorization rank.
On our own experience, a value of 100 runs per factorization rank (see 'Number of runs' parameter) is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].
This method is always used if a range of factorization ranks is supplied, or if the Sample Classification analysis method is selected.
Note: The maximum factorization rank allowed for this on-line version is 32 factors.
Due to the non-deterministic nature of NMF, it may or may not converge to the same solution on each run depending on the random initial conditions. Therefore, executing the algorithm several times with different random initializations is a good approach for selecting the W and H matrices that best approximates the input matrix. Depending on the problem, more or less runs will be necessary to achieve an optimum solution. However, considering that the computational cost of this algorithm is very high, a limited number of runs is recommended. On our own experience, a value of 100 runs is normally enough to achieve reasonable results [Carmona-Saez et al. 2006].
If you select a range of factorization ranks, bioNMF will try the specified number of runs for each rank in that range.
Note: The maximum number of runs allowed for this on-line version is 128 runs.
This parameter controls the maximum number of iterations per run to allow algorithm convergence. A maximum of 2000 iterations is enough on most cases.
Note: The maximum number of iterations allowed for this on-line version is 4096.
This parameter controls the algorithm convergence on each run. bioNMF makes use of the convergence method described in [Brunet et. al, PNAS 2004].
Each 10 iterations, a connectivity matrix C of size M × M is computed, where M is the number of columns of matrix H. Each entry Cij in this matrix is set to 1 if column i and j in H have their maximum value for the same factor (ie. on the same row in H), and 0 otherwise.
If the connectivity matrix stop changing after a certain number of iterations (equals to the stopping threshold multiplied by 10), the matrices are considered as having converged and the algorithm stops the current run.
Note: This parameter has no effect if it is greater than the maximum number of iteration divided by 10.
bioNMF lets you choose between the following variants of the NMF algorithm:
This is the classical algorithm proposed by D. D. Lee and H. S. Seung (Nature, 1999).
Variant proposed by D. D. Lee and H. S. Seung from a divergence cost function (NIPS, 2001).
Variant proposed by A. Pascual-Montano et al. (IEEE TPAMI, 2006; BMC Bioinformatics, 2006). The cost function is derived by introducing an extra smoothness matrix (S) in order to demand sparseness to data.
The positive symmetric matrix S∈R^{K×K} (where K is the current factorization rank) is a smoothing matrix defined as:
Here you can find an example script implemented in Ruby:
http://bionmf.dacya.ucm.es/2.0/WebService/bioNMFWS_client.rb
Usage:
$> ruby bioNMFWS_client.rb <input-matrix_filename>
Output files will be saved to directory: "output_<input-matrix_filename>
".
Please consider downloading and using these matrices for testing:
WARNING: This test works ONLY with ASCII-text files with non-numeric labels (or with no labels at all) like the ones above. To use it with binary files, please edit this file and set to true
the second argument of the function call to upload_matrix
. To use numeric column or row labels, please set to true
the third and fourth arguments (respectively) in the same function call.
If you use this software, please cite the following work: