Data Mining and Machine Learning in Bioinformatics (BioE 594)
Datamining of biological and medical data is an emerging area that is becoming increasingly popular in bioinformatics. Biological and medical phenomenons are being studied with high throughput methods that generate large amount of data. To understand the complicated and highly interactive nature of biology from this data, advanced datamining methods and algorithms are required. This course will focus on the basic knowledge of datamining and how it is applied and modified in order to adjust to special characteristics of biological and medical datamining.
Topics covered:
- Introduction to datamining
- Basic datamining algorithms
- Supervised and unsupervised classifications
- Association rule mining
- Bioinformatics applications of datamining
- Medical applications of datamining
- Text mining
Syllabus:
- Lecture 1: Introduction: Themes of course and machine learning problems in bioinformatics
- Lecture 2: Data types, preprocessing, and visualization
- Lecture 3: Bioinformatics data, feature extraction and visualization
- Lecture 4: Simple Machine Learning algorithms: Basic concepts and Decision Trees
- Lecture 5: Evaluation of classifiers: Metrics, overfitting, and model expressiveness
- Lecture 6: More algorithms I: Ruled based classifier, nearest neighbor, and Naive Bayes
- Lecture 7: More algorithms II: ANN, SVM, and ensemble classifiers
- Lecture 8: Introduction to machine learning workbenches: Malibu,Yale, and Weka
- Lecture 9: Bioinformatics models I Models using the above algorithms
- Lecture 10: Bioinformatics models II Models using the above algorithms
- Lecture 11: Unsupervised learning: Clustering algorithms
- Lecture 12: Unsupervised learning: Graph based clustering
- Lecture 13: Clustering in Bioinformatics
- Lecture 14: Text mining I (X)
- Lecture 15: Text mining II (X)
- Lecture 16: Mining biological literature application papers
- Lecture 17: Advanced topics: Multiple instance learning
- Lecture 18: Advanced topics: SVM with string and graph based kernels in sequence analysis
- Lecture 19: Advanced topics: Semi-supervised learning
- Lecture 20: Advanced topics: Bayesian statistics
We will normally post 2 lectures per week (on Mondays).
Prerequisites:
- College level math
- Programming knowledge of Java (preferred), Perl or C++
- A course in algorithms or equivalent knowledge
Textbook:
To be determined. We will be posting the materials according to progress of lectures beyond lecture slides: online resource, tutorials, papers, our own writing materials.
Grading:
- Homework
- Worth 100 points and will be assigned each week (8-12 total homework assignments)
- Will be posted on Wednesday and will be due the following Wednesday.
- For some weeks, we may post homework early. This will not affect the due date of the homework.
- Late homework will be accepted until the first Friday following the due date with a penalty of 20 points per day late. Homework will not be accepted after Friday.
- Comprehensive project (due on the same day as the final exam)
- Midterm and final exams
