Machine Learning Regression Model

1.Abstract

Machine learning regression models are powerful to make a prediction to reveal the pattern from big data (Bzdok et al., 2018). They are scalable, flexible and capable of splitting processes into smaller chunks which run simultaneously i.e. parallelisation (Upadhyaya, 2013). However, these techniques pose a significant challenge to model spatial pattern as most machine learning regression are not intended to deal with spatial data (Santibanez, Lakes, et al., 2015). Moreover, excellent goodness of fit can be achieved when the data is highly clustered but this might indicate to overfitting (Santibanez, Lakes, et al., 2015). In other words, when the density of the cluster (as each cluster has their own features and different with each other) in a data is high, the model trying to learns the detail from each cluster in the data as a concept but this concept can not be applied to new data. This situation negatively affected the ability of the model to generalize the learning. Apart from that, noise in the data can also induce spatial pattern that might lead to overfitting (Rocha et al., 2018). To handle clustering in the data, a new approach, mixed effects machine learning has been proposed (Cho, 2010; Hajjem, Bellavance, & Larocque, 2014; Luts, Molenberghs, Verbeke, Van Huffel, & Suykens, 2012; Seok, Shim, Cho, Noh, & Hwang, 2011).
Mixed effects models are well-suited for datasets with clustered structure. Clustered data emerge when the datasets can be classified into a number of different groups (Galbraith, Daniel, & Vissel, 2010). Cluster structure can be longitudinal or hierarchical. Longitudinal structure arises when multiple observation measured within the same cluster, for instance bare soil and forest land cover cluster. As for hierarchical cluster treating each observation into a separate cluster then merge the cluster that has similarity, for instance deciduous forest landcover contained within forest landcover. Each cluster distinct from each other cluster. Mixed-Effects with Random Forest (MERF) approach showed significant improvements over global random forest when random effects is substantial (Hajjem et al., 2014). Apart from that, mixed effects support vector machine (MESVM) using least square kernel for handling longitudinal data and highly unbalance data also has been proposed (Cho, 2010; Luts et al., 2012; Seok et al., 2011). However, it is noteworthy that MESVM approach library (code) for regression is unavailable.

2.Keywords
Random Forest, SVM, Mixed Effects, Machine Learning, Spatial
3.Objective

To evaluate a generic machine learning regression model vs mixed effects machine learning regression model using real world spatial data.

4.Methodology

- 3 real world datasets
- spatial processing
- machine learning model development
- tuning
- evaluation

5.Team

Afnindar Fakhrurrozi (Research Center for Geotechnology - Indonesian Institute for Sciences (LIPI))
Prof. Dr. Raul Zurita Milla (GIP, ITC - Twente University)
Dr. Rania O Konaidi (GIP, ITC - Twente University)

6.Computation plan (required processor core hours, data storage, software, etc)

all processor cores will be used (njob = -1)
software: scikit-learn, gdal/ogr (proj), anaconda, latest merf, matplotlib, git, python 3.6, pydotplus, graphviz, pandas-profiling, pandas

learning lesson: datasets contain more than 2 milion of rows and atleast 10 columns (5 years data). hence, it really needs a big size of ram. my 16gb of ram cannot handle this kind of data and always out of memory and crashed. successfully completed a learning process only with 2 features and 1 year data for 1 day (24 hrs).

7.Source of funding
8.Target/outputs
Thesis
9.Date of usage
25/10/2018 - 31/01/2019
10.Gpu usage
-
11.Supporting files
12.Created at
24/10/2018
13.Approval status
approved