Previous: Installation  

 

Interface

Parameters

Dataset Format

Output Format

Parameters

 

HTML Editor

 

A convenient interface for any particular machine learning algorithm is the HTML interface. The image below shows a screent shot of the malibu HTML interface, which provides a guided interaction with underlying machine learning algorithms. Currently, the malibu HTML interface only creates a configuration file, see Configuration File for details.

Configuration File

 

A configuration file can be obtained by simply running the program without any arguments and redirect the output to a file, e.g. ./program > test.cfg. A partial example of a configuration is depicted below.

# Exegete Version 1.0 beta: adtree
# UIC Bioinformatics
# Chicago, Illinois
# Robert Langlois & Hui Lu
# Contact: rlangl1@uic.edu

 

# Learning Arguments
trainfile: data/train.csv #training set file
testfile: data/test.csv   #testing set file

Any line prefixed with a '#' is ignored as a comment. A parameter has the format parameter-name: parameter-value. The configuration file can be used by redirecting the file to the standard input, e.g. ./program < test.cfg.

 

It is also possible to clone relavent parameters in a configuration file. This is useful in the case where malibu has been updated or you need to run a similar experiment with another learning algorithm. This is done as follows:

./program -self clone < test_old.cfg > test_new.cfg

Note, ignore any error messages.

 

Command-line

 

The command line arguments can be found in the configuration file. Simply append '-' to the parameter-name and add the value separated by a space, e.g. -trainfile data/train.csv.

 

Parameter Types

 

malibu supports several types of parameters. This is the first layer of validation ensuring every parameter has an appropriate value. The comment indicates the parameter type.

  • boolean - expects one of two values
    • values: Yes,Y,y,1 or No,N,n,0
  • position - expects an integer or position option
    • values: integer or Begin,B,b,0 or End,E,e,-1
  • option - expects one of several values (* indicates any value)
    • value: Yes:1 --> Yes or 1 is valid
  • string - expects any value
  • vector - expects a series of tokens, separated by a "," and/or ":"

Dataset Format

 

The malibu dataset format is a flexible variation on the standard CSV format. It has two sections: header and data.

 

The header is the first row in the file and must contain non-numerical entries for AUTO detection to work. See the example below. Otherwise, the auto detection parser assumes there is no header, or you can set -header Yes.

ID, Class, Code1, 2Code, Three

g1, dnahh, 0.333, 0.421, 5.000

g2, nothh, 0.823, 4.034, 5.000

The token separating values may be a comma, tab or space. By default, the first column is considered a label and the first column after the label(s) is considered the class attribute.

g1 an1 dnahh 0.333 0.421 5.000 1

g2 an2 nothh 0.823 4.034 5.000 1

g2 an2 nothh 0.223 1.034 3.000 0

The configuration parameters label and class can be used to deal with non-standard dataset files like example above: -label 3 -class 1000. A large value for class sets the class attribute to the last column. Note, the class attribute is always treated as a string. The dataset may also contain sparse attributes (so far only used for LIBSVM). A sparse attribute has the format index:value e.g. 1:0.3 and the sparse index starts with 1.

 

The dataset supports two attribute types: nominal and real. The nominal attributes must contain non-numerical characters or be a quoted string. A distance based algorithm (e.g. LIBSVM) cannot support nominal attributes and by default will cause the program to exit with an error. To circumvent this condition, one can set -strict NO; this tells each algorithm to ignore such dataset/algorithm inconsistencies. Keep in mind, using this option is not recommended and may cause a learning algorithm to become unstable.

 

 

Output Format

 

The malibu output format is a standard flatfile where each section is prefixed with two characters:

LS - Training set statistics

VS - Validation set statistics

TS - Test set statistics

ET{type} - Experiment Type

SP{type} - Standard prediction

{type} - Types

__ - Configuration file

#LS Training Set> dna20appex

#LS Examples: 289

#LS Attributes: 945

#LS Classes: 2

#LS Missing: 0

#LS Positive: NOT

#LS DNA: 73

#LS NOT: 216

#ETVA ADtree | 10-CV | dna20appex

#SPVA 1a36 0 -3.13177

#SPVA 3cro 0 -1.7068

#SPVA 1tsr 0 -0.706242

1. Dataset description (LS | VS | TS)

The dataset description summarizes the characteristics of the dataset as well as the name and location of the dataset. It includes a summary of the validation method and learning algorithm.

 

The data characteristics include:

  • Dataset type and filename
  • number of bags (optional)
  • number of examples
  • number of attributes
  • number of classes
  • number of missing attributes
  • number of zeros
  • number of binary attributes
  • number of discreet attributes
  • number of numeric attributes
  • number of nominal attributes
  • identifier of the positive class
  • class labels and proportion of examples (bags)

2. Experiment Type (ET{type})

This single line describes the experiment type:

  • Learning algorithm name
  • Validation type
  • Dataset name

3. Standard prediction for each run (SP{type})

The standard prediction section minimally describes each example and its prediction.

  • Example Label (or index if no label)
  • Class index
  • Confidence-rated prediction

4. Evaluate type suffix ({type})

A suffix for experiment type and standard prediction.

  • TS - test set
  • VS - validation set
  • SC - self-consistency
  • VA - validation algorithms

5. Graphical illustration of model (GM)

Some classifiers have a model that can be graphically (as a graph) illustrated, e.g. decision tree. Currently, the alternating decision tree (ADTree) algorithm supports such a graph-based model.

6. Configuration file

At the end of the output file is a configuration file that describes the experiment that was run. In fact, the entire output file is commented in such a way that it can act as a configuration file for a repeated experiment.

 

Previous: Installation