Learning Multi-dimensional Classifiers on High Volume Datasets

1.Abstract

Multi-dimensional classifiers has been increasingly required to tackle classification problems on real-world datasets, specifically complex data that consists of hierarchical level of attribute links. In this study, several supervised learning methods: Bayesian networks, SVMs, Decision Trees, Naive Bayes, Nearest Neighbours will be analysed . The contribution of this study lies on its multi-dimensional characteristic of data-sets (i.e. data-sets with multiple class variables and more than 2 features) and the construction of multi-dimensional classifiers, as compared to traditional single-label classification that focuses more on single class variable consisting either binary or n-ary features and most previous research in this problem domain that uses multi-label approach (i.e. data-sets consist of multiple class variables with binary features). Parallelisation will also be applied to speed up the training and testing stage of supervised learning methods.

2.Keywords
Multi-dimensional classification, parallelisms, data mining
3.Objective

The objective of this research is to boost up the learning speed and performance of classifiers, i.e. by using greater number of data-sets and parallelisation techniques.

4.Methodology

The performance of several classification methods: Bayesian networks, SVMs, Decision Trees, Naive Bayes, Nearest Neighbours on complex pre-processed transaction data will be compared and analysed in this study. Parallelisation will be applied on the construction stage of Bayesian network classifiers, which is known as NP-hard problem, using IPython parallel architecture components. The constructed networks will be further evaluated and compared to the rest of non-Bayesian classifiers using several performance metrics: statistical significant tests, training and testing time, mean accuracy, and roc.

5.Team

Iftitahu Ni'mah

6.Computation plan (required processor core hours, data storage, software, etc)

The previous experiment has been done using the following working environment and experimental design *):
- 16 core AMD Opteron CPUs with total 65GB RAM and 500GB Hard disk size
- 22 pre-processed data-samples, hard-disk size 500KB-1MB per data-file (txt), consisting of 5000 instances (rows) per data (so total hard-disk size of training data is approx. 22 MB)

*) For the record, the construction of 1 model (classifiers) in parallel using this setting requires 4-6 days

Future plan for this proposal:
- using one single training data for learning and constructing classifiers, hard-disk size: 264MB, consisting of 1-Million instances (rows) or improve the previous experiment by using greater number of training datasets.
- testing the constructed classifiers on single test data, size: 200MB, consisting of 800.000 inatsnces (rows) or less

Softwares and libraries needed (free):
Python2.7
IPython
Weka3.7 (min 3.7.10 version): http://www.cs.waikato.ac.nz/~ml/weka/downloading.html
Editor (sublime or anys)
Netica-Java API: https://www.norsys.com/netica-j.html#download
Cytoscape: http://www.cytoscape.org/

Python libraries:
pandas
numpy
scipy
statsmodel, ggplot, matplotlib
scikit-learn or sklearn: http://scikit-learn.org/stable/install.html
BNFinder: https://pypi.python.org/pypi/BNfinder/2.0.4

Weka libraries:
Mulan: http://mulan.sourceforge.net/download.html
Meka: http://meka.sourceforge.net/#download
Clus: http://clus.sourceforge.net/doku.php?id=download

 

7.Source of funding
independent research
8.Target/outputs
International journal and/or conference
9.Date of usage
01/09/2015 - 11/12/2015
10.Gpu usage
-
11.Supporting files
12.Created at
14/08/2015
13.Approval status
approved