| Previous: Installation |
Interface
HTML Editor
A convenient interface for any particular machine learning algorithm is the HTML interface. The image below shows a screent shot of the malibu HTML interface, which provides a guided interaction with underlying machine learning algorithms. Currently, the malibu HTML interface only creates a configuration file, see Configuration File for details.
A configuration file can be obtained by simply running the program without any arguments and redirect the output to a file, e.g. ./program > test.cfg. A partial example of a configuration is depicted below.
# Exegete Version 1.0 beta: adtree
# UIC Bioinformatics
# Chicago, Illinois
# Robert Langlois & Hui Lu
# Contact: rlangl1@uic.edu
# Learning Arguments
trainfile: data/train.csv #training set file
testfile: data/test.csv #testing set file
Any line prefixed with a '#' is ignored as a comment. A parameter has the format parameter-name: parameter-value. The configuration file can be used by redirecting the file to the standard input, e.g. ./program < test.cfg.
It is also possible to clone relavent parameters in a configuration file. This is useful in the case where malibu has been updated or you need to run a similar experiment with another learning algorithm. This is done as follows:
./program -self clone < test_old.cfg > test_new.cfg
Note, ignore any error messages.
Command-line
The command line arguments can be found in the configuration file. Simply append '-' to the parameter-name and add the value separated by a space, e.g. -trainfile data/train.csv.
Parameter Types
malibu supports several types of parameters. This is the first layer of validation ensuring every parameter has an appropriate value. The comment indicates the parameter type.
- boolean - expects one of two values
- values: Yes,Y,y,1 or No,N,n,0
- position - expects an integer or position option
- values: integer or Begin,B,b,0 or End,E,e,-1
- option - expects one of several values
(* indicates any value)
- value: Yes:1 --> Yes or 1 is valid
- string - expects any value
- vector - expects a series of tokens, separated by a "," and/or ":"
The malibu dataset format is a flexible variation on the standard CSV format. It has two sections: header and data.
The header is the first row in the file and must contain non-numerical entries for AUTO detection to work. See the example below. Otherwise, the auto detection parser assumes there is no header, or you can set -header Yes.
ID, Class, Code1, 2Code, Three
g1, dnahh, 0.333, 0.421, 5.000
g2, nothh, 0.823, 4.034, 5.000
The token separating values may be a comma, tab or space. By default, the first column is considered a label and the first column after the label(s) is considered the class attribute.
g1 an1 dnahh 0.333 0.421 5.000 1
g2 an2 nothh 0.823 4.034 5.000 1
g2 an2 nothh 0.223 1.034 3.000 0
The configuration parameters label and class can be used to deal with non-standard dataset files like example above: -label 3 -class 1000. A large value for class sets the class attribute to the last column. Note, the class attribute is always treated as a string. The dataset may also contain sparse attributes (so far only used for LIBSVM). A sparse attribute has the format index:value e.g. 1:0.3 and the sparse index starts with 1.
The dataset supports two attribute types: nominal and real. The nominal attributes must contain non-numerical characters or be a quoted string. A distance based algorithm (e.g. LIBSVM) cannot support nominal attributes and by default will cause the program to exit with an error. To circumvent this condition, one can set -strict NO; this tells each algorithm to ignore such dataset/algorithm inconsistencies. Keep in mind, using this option is not recommended and may cause a learning algorithm to become unstable.
The malibu output format is a standard flatfile where each section is prefixed with two characters:
VS - Validation set statistics
#LS Training Set> dna20appex
#LS Examples: 289
#LS Attributes: 945
#LS Classes: 2
#LS Missing: 0
#LS Positive: NOT
#LS DNA: 73
#LS NOT: 216
#ETVA ADtree | 10-CV | dna20appex
#SPVA 1a36 0 -3.13177
#SPVA 3cro 0 -1.7068
#SPVA 1tsr 0 -0.706242
1. Dataset description (LS | VS | TS)
The dataset description summarizes the characteristics of the dataset as well as the name and location of the dataset. It includes a summary of the validation method and learning algorithm.
The data characteristics include:
- Dataset type and filename
- number of bags (optional)
- number of examples
- number of attributes
- number of classes
- number of missing attributes
- number of zeros
- number of binary attributes
- number of discreet attributes
- number of numeric attributes
- number of nominal attributes
- identifier of the positive class
- class labels and proportion of examples (bags)
This single line describes the experiment type:
- Learning algorithm name
- Validation type
- Dataset name
3. Standard prediction for each run (SP{type})
The standard prediction section minimally describes each example and its prediction.
- Example Label (or index if no label)
- Class index
- Confidence-rated prediction
4. Evaluate type suffix ({type})
A suffix for experiment type and standard prediction.
- TS - test set
- VS - validation set
- SC - self-consistency
- VA - validation algorithms
5. Graphical illustration of model (GM)
Some classifiers have a model that can be graphically (as a graph) illustrated, e.g. decision tree. Currently, the alternating decision tree (ADTree) algorithm supports such a graph-based model.
At the end of the output file is a configuration file that describes the experiment that was run. In fact, the entire output file is commented in such a way that it can act as a configuration file for a repeated experiment.
| Previous: Installation |
