Everybody seems to be doing “knowledge discovery” these days, but few people have a clear grasp of the categories of techniques available. Even fewer know how and why the techniques can be optimally used for their particular problems. In this class, we will look at the entire knowledge discovery process. As part of this process, we focus on cutting edge, interesting data mining techniques that can be used in a wide variety of settings (business, science, web). We will discuss algorithmic details, implementation issues, advantages and disadvantages, and look at many examples.
Upon successful completion of this class you will have a thorough understanding of the techniques used in data warehousing and data mining applications and their advantages and disadvantages. You will have experience working on data mining projects and applying the techniques. You will be able to do you own data mining projects and be successful in knowledge discovery.
Recommended Book (but Optional):
Fundamentals of Natural Computing: Basic Concepts, Algorithms, and Applications (Chapman & Hall/Crc Computer and Information Sciences)
by Leandro Nunes de Castro (Author)
Publisher: Chapman & Hall/CRC; 1 edition (June 2, 2006)
ISBN-10: 1584886439
ISBN-13: 978-1584886433
You will be required to take notes in class. The instructor will not hand out class notes.
Links to open source software, interesting projects and other additional materials will be posted on the website.
Students must have basic programming skills. The use of open source and other software is encouraged as well as helping each other solve problems. You should have access to a computer for development and project demonstrations
Each student will work on a real data mining problem from start to finish. Class time will also be devoted to the projects. However, each student will be responsible for his or her project and will be graded individually.
| Knowledge Discovery Project | 50% | ||
| Problem Choice | |||
| Data Preprocessing | 10% | ||
| Algorithm Proposal | 10% | ||
| Evaluation Proposal | 10% | ||
| Project Presentation | 5% | ||
| Final Project | 15% | ||
| Software Review | 10% | ||
| Midterm Exam | 20% | ||
| Final Exam | 20% | ||
Letter assignment: 90/100 = A, 80/100 = B, 70/100 = C, below 70 = U
You are required to attend all lectures, including student presentations. It is your responsibility to obtain material from a fellow student if you miss a lecture. Office hours are not meant as individual lectures. Class notes will be vital to do well on exams. It will be your responsibility to study both the lecture notes and the chapters in the book.
Cheating during exams or any type of dishonesty will result in a failing grade for this class and will be reported to the University.
Although the use of open source software is allowed, it is the student’s responsibility to acknowledge this resource (with name and origin) and understand the software before using it.
The examinations will be closed book. Exams cannot be taken at a different time (even if the exam time differs from the one on the syllabus), unless permission to do so was requested and received at least two weeks before the exam. Failure to show up for the exam will result in a zero unless there was a documented emergency (doctor’s note, etc).
Detailed assignments will be handed out in class. The assignments will also be posted on the class website. Assignments will be accepted late, however, the earned grade will be reduced by 20% for each day the assignment was handed in late. Assignments more than 5 days late will not be accepted. If you cannot attend class, it is still your responsibility to ensure your assignment is submitted by the deadline.
Tentative course outline
|
Date |
Topics |
Deadlines & Other Info (dates are subject to change)
|
|
Jan 22 |
Introduction
Class Overview
|
|
|
Jan 29 |
Associations Rules & A Priori Algorithm
Project Introduction
|
Additional Data from Prof. Merolla
|
|
Feb 5 |
Classification/Prediction: Decision Trees (Symbolic)
|
|
|
Feb 12 |
Classification/Prediction: Naïve Bayes (Statistical)
Discussion Project Topic
|
Be prepared to discuss your approach |
|
Feb 19 |
Classification/Prediction: Neural Networks (Connectionist)
|
Assignment 1 Due: Problem Choice |
|
Feb 26 |
Clustering: Classical
|
|
|
Mar 4 |
Software Review : Discussion
Review Midterm Exam
|
Assignment 2 Due: Software Review |
|
Mar 11
|
MIDTERM EXAM |
|
|
Mar 18 |
SPRING BREAK
|
|
|
Mar 25 |
Clustering: SOM (Connectionist)
Project Work Time
|
Be prepared to discuss your approach |
|
Apr 1 |
Evaluating Results
Project Evaluation Discussion/Application
Project Work Time
|
Assignment Due: Data Preprocessing Assignment 4: Algorithm Proposal |
|
Apr 8 |
Graph Search (Linear Search)
Project Work Time
|
|
|
Apr 15 |
Genetic Algorithm (Parallel Search)
Project Work Time |
Assignment Due: Algorithm ChoiceAssignment Due: Evaluation Choice |
|
Apr 22 |
Project Work Time (or TBA) |
|
|
Apr 29 |
Class Review
Project Work Time |
Assignment 6: Project Presentation |
|
May 6 |
Project Presentations
|
Assignment Due: Presentations |
|
May 13 |
FINAL EXAM
|
Assignment Due: Final Project |