Multi-dimensional classifiers has been increasingly required to tackle classification problems on real-world datasets, specifically complex data that consists of hierarchical level of attribute links. In this study, several supervised learning methods: Bayesian networks, SVMs, Decision Trees, Naive Bayes, Nearest Neighbours will be analysed . The contribution of this study lies on its multi-dimensional characteristic of data-sets (i.e. data-sets with multiple class variables and more than 2 features) and the construction of multi-dimensional classifiers, as compared to traditional single-label classification that focuses more on single class variable consisting either binary or n-ary features and most previous research in this problem domain that uses multi-label approach (i.e. data-sets consist of multiple class variables with binary features). Parallelisation will also be applied to speed up the training and testing stage of supervised learning methods.
Multi-dimensional classification, parallelisms, data mining
The objective of this research is to boost up the learning speed and performance of classifiers, i.e. by using greater number of data-sets and parallelisation techniques.
The performance of several classification methods: Bayesian networks, SVMs, Decision Trees, Naive Bayes, Nearest Neighbours on complex pre-processed transaction data will be compared and analysed in this study. Parallelisation will be applied on the construction stage of Bayesian network classifiers, which is known as NP-hard problem, using IPython parallel architecture components. The constructed networks will be further evaluated and compared to the rest of non-Bayesian classifiers using several performance metrics: statistical significant tests, training and testing time, mean accuracy, and roc.
The previous experiment has been done using the following working environment and experimental design *):
- 16 core AMD Opteron CPUs with total 65GB RAM and 500GB Hard disk size
- 22 pre-processed data-samples, hard-disk size 500KB-1MB per data-file (txt), consisting of 5000 instances (rows) per data (so total hard-disk size of training data is approx. 22 MB)
*) For the record, the construction of 1 model (classifiers) in parallel using this setting requires 4-6 days
Future plan for this proposal:
- using one single training data for learning and constructing classifiers, hard-disk size: 264MB, consisting of 1-Million instances (rows) or improve the previous experiment by using greater number of training datasets.
- testing the constructed classifiers on single test data, size: 200MB, consisting of 800.000 inatsnces (rows) or less
Softwares and libraries needed (free):
Weka3.7 (min 3.7.10 version): http://www.cs.waikato.ac.nz/~ml/weka/downloading.html
Editor (sublime or anys)
Netica-Java API: https://www.norsys.com/netica-j.html#download
statsmodel, ggplot, matplotlib
scikit-learn or sklearn: http://scikit-learn.org/stable/install.html
International journal and/or conference
01/09/2015 - 11/12/2015