Complementary information for the conference published in IEEE International Conference on Fuzzy Systems 2012 (FUZZ-IEEE)

A Preliminary Study on Missing Data Imputation in Evolutionary Fuzzy Systems of Subgroup Discovery

In real-life data, a loss of information is frequent in data mining due to the presence of missing values in the attributes. Missing values can occur due to problems in the manual data entry procedures, equipment errors or incorrect measurements. The presence of missing values in attributes conditions the results obtained by any knowledge extraction approach. Specifically, this problem could lead in subgroup discovery to a loss of quality of results obtained by subgroups on measures such as sensitivity, confidence, significance or unusualness.

This paper presents an experimental study to analyse the effect of different missing data imputation mechanisms within subgroup discovery algorithms based on evolutionary fuzzy systems presente throughout the literature. The analysis is carried out with a large number of data sets obtained from KEEL repository. Among all the imputation techniques, the imputation method KNearest Neighbour outstands as the best option. In summary, if experts need to analyse a problem with a high percentage of missing values they must use this imputation method in order to treat data in a correct way and also to obtain a meaningful descriptive knowledge. In addition, results also show that the evolutionary fuzzy system with the best results is the algorithm NMEEF-SD in the missing values scenario.

IV. Experimental Study

The complete results table can be found below: