Curse of Dimensionality in ML
IME provides machine learning and artificial intelligence solutions, handling big data efficiently. Modelling ML is not a straightforward task, it needs experienced engineers in both business and statistics to extract the optimum features from training data to achieve accurate predictions in a real time. In this article, we will explain the meaning of ‘Curse of Dimensionality’ problem that may face ML modelers, its effect on model complexity, and how it can be avoided.
Definition of Curse of Dimensionality
Curse of dimensionality refers to non-initiative properties of data observed when working in high dimension space . Dimension of data means the number of features that can represent an observation or data point, so-called feature length. Dimension space means the available values of each dimension.
In other words, the curse of dimensionality belongs to various phenomena that arise when analyzing and organizing data when the dimension is very high . Its domain contains sampling, optimization, distance functions, nearest neighbor search, anomaly detection, and machine learning.
Curse of dimensionality and ML modeling
Most machine learning techniques use segmentation and clustering in modelling, these methods rely on computing distances between data points to classify and specify similar ones. Building a better model is highly dependent on selecting the most informative set of predictive features carefully.
As shown in Figure 1, the dimension space increases exponentially when the dimension increased from 1 to 2 to 3, the space accordingly increased from 10 to 100 to 1000. For higher dimensions; hundreds, and millions, the situation will be more complex to visualize and manipulation.
High-dimension impact on relevant run-time issues
In addition to distances and volumes, number of dimensions raisessome practical problems such as memory usage and high execution time. Their complexity are often non-linearly, which leads to escalationif the number of dimensions increases.
According to that exponential increase, many optimization methods fail to reach global optima and pick local optima. Thus, optimization technique must use search-based algorithms, like gradient descent, genetic algorithm and simulated annealing, instead of closed-form solution. The possibility of correlation occurrence and parameter estimation becomes higher with high dimension, leading to more difficulties in regression approaches.
Dealing with High-dimension
Correlation analysis, clustering, information value, variance inflation factor, principal component analysis (PCA) are some of the ways in which number of dimensions can be reduced.
Curse of dimensionality on the scope of Big Data
Variety if one of Big Data definition components, which is related to variant inputs and hence massive multivariate data set, which in turn needs filtering to retain only the effective features. Visualization of smaller number of features is simpler and clearer than of huge number of features.
Some ways to reduce dimensions :
- Removing features with high missing values: as these data columns are less likely to carry information.
- Low Variance Filter: Similar to the previous point, data columns with small changes in the data carry less information. Thus, all features with variance lower than a predefined threshold are removed. It is required to normalize values before filtering.
- High Correlation Filter: Data columns with very similar trends or directions are also likely to catch similar information. Thus, only one of them will be sufficient to feed the machine learning model. Pairs of columns with correlation coefficient higher than a specified suitable threshold can be reduced to only one. As correlation is scale sensitive; features must be normalized for an accurate correlation comparison.
- Principal Component Analysis (PCA): it is a statistical procedure to transform the original n dimension of the dataset to m, orthogonally or uncorrelated, where m<n.
The correct use of input dimension is a key in successful ML application. Working with large number of features, so-called curse of dimensionality, leads to more difficulties in data visualization, less coverage to the data, being more complex to check the ML model. In addition, it becomes more likely to have redundant features and can causes data over fit. Thus, selecting the more relevant and discriminative features accurately is important in the process of ML modeling.
In IME, we construct machine learning models to several use cases in many fields: financial, health, education, governmental, and other sectors. We customize the ML models efficiently according to customer needs and field nature. With good usage of customer’s historical data, we design a suitable ML model to predict the related actions, and take important decisions based on these predictions.