Development of models for imputation of data from social networks on the basis of an extended matrix of attributes

Authors

DOI:

https://doi.org/10.15587/1729-4061.2016.74871

Keywords:

imputation of data, an extended matrix of attributes, an imputation model, an ensemble of models

Abstract

Missing data cause various problems when processing and analysing real-world datasets. In this paper, we consider the structure of data from social networks’ accounts and introduce models of imputation and their ensembles designed for this area.

We have analysed the structure and strength of correlations between data from social networks’ accounts and suggested an approach to imputation on the basis of an extended matrix of attributes. We have justified a step of preliminary clustering when processing missing data, which helps overcome the problem of a large number of unique values in analysable variables. We have designed models of imputation on the basis of association rules, a random forest, a support vector machine, a neural network, and an EM algorithm while using preliminary clustering and an extended matrix of attributes. We have compared the performance of these models with the most popular method of imputation “Most Common Value” (MCV), which is usually integrated into statistical packages. These results demonstrate that the MCV method is not well suited for data from social networks’ accounts in terms of two evaluation criteria.

Using the suggested models, we have developed ensembles of models for imputation of nominal and numerical data types. We have shown that the ensembles of the models can handle missing values more effectively and stably in terms of the concerned evaluation criteria in comparison with the single models.

Author Biographies

Olesia Slabchenko, Kremenchuk Mykhailo Ostohradskyi National University Pershotravneva str., 20, Kremenchuk, Ukraine, 39600

Computer and information systems department

Valeriy Sydorenko, Kremenchuk Mykhailo Ostohradskyi National University Pershotravneva str., 20, Kremenchuk, Ukraine, 39600

Associate professor, PhD

Computer and information systems department

Xavier Siebert, University of Mons, Belgium rue de Houdain, 9, Mons, Belgium, 7000

Associate professor, PhD

Mathematics and Operational Research Department

References

  1. Slabchenko, O. O., Sydorenko, V. N. (2013). Uluchsheniie kachestva iskhodnykh dannykh v zadachakh modelirovaniia internet-soobshchestv na osnove kompleksnogo primeneniia modelei segmentatsii, imputatsii i obogashcheniya dannykh. Visnik KrNU, 6, 50–58.
  2. Kossinets, G. (2006). Effects of missing data in social networks. Social Networks, 28 (3), 247–268. doi: 10.1016/j.socnet.2005.07.002
  3. Nakagawa, S., Freckleton, R. P. (2008). Missing inaction: the dangers of ignoring missing data. Trends in Ecology and Evolution, 23 (11), 592–596. doi: 10.1016/j.tree.2008.06.01
  4. Graham, J. W. (2009). Missing Data Analysis: Making It Work in the Real World. Annual Review of Psychology, 60 (1), 549–576. doi: 10.1146/annurev.psych.58.110405.085530
  5. Rhoads, C. H. (2012). Problems with Tests of the Missingness Mechanism in Quantitative Policy Studies. Statistics, Politics, and Policy, 3 (1). doi: 10.1515/2151-7509.1012
  6. Schafer, J. L., Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7 (2), 147–177. doi: 10.1037/1082-989x.7.2.147
  7. Schlomer, G. L., Bauman, S., Card, N. A. (2010). Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57 (1), 1–10. doi: 10.1037/a0018082
  8. Chang, C., Butar, F. B. (2013). Weighting Adjustments in Survey Sampling. European International Journal of Science and Technology, 2 (9), 214–236.
  9. Huisman, M. (2009). Journal of Social Structure, 10 (1), 1–10.
  10. Baraldi, A. N., Enders, C. K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48 (1), 5–37. doi: 10.1016/j.jsp.2009.10.001
  11. Andridge, R. R., Little, R. J. A. (2010). A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78 (1), 40–64. doi: 10.1111/j.1751-5823.2010.00103.x
  12. Schmitt, P., Mandel, J., Guedj, M. (2015). A Comparison of Six Methods for Missing Data Imputation. Journal of Biometrics & Biostatistics, 06 (01). doi: 10.4172/2155-6180.1000224
  13. Silva-Ramírez, E.-L., Pino-Mejías, R., López-Coello, M., Cubiles-de-la-Vega, M.-D. (2011). Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks, 24 (1), 121–129. doi: 10.1016/j.neunet.2010.09.008
  14. Aydilek, I. B., Arslan, A. (2013). A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Information Sciences, 233, 25–35. doi: 10.1016/j.ins.2013.01.021
  15. Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50 (2), 105–115. doi: 10.1016/j.artmed.2010.05.002
  16. Nanni, L., Lumini, A., Brahnam, S. (2012). A classifier ensemble approach for the missing feature problem. Artificial Intelligence in Medicine, 55 (1), 37–50. doi: 10.1016/j.artmed.2011.11.006
  17. Gebregziabher, M., DeSantis, S. M. (2010). Latent class based multiple imputation approach for missing categorical data. Journal of Statistical Planning and Inference, 140 (11), 3252–3262. doi: 10.1016/j.jspi.2010.04.020
  18. Senapti, R., Shaw, K., Mishra, S., Mishra, D. (2012). A Novel Approach for Missing Value Imputation and Classification of Microarray Dataset. Procedia Engineering, 38, 1067–1071. doi: 10.1016/j.proeng.2012.06.134
  19. Stekhoven, D. J., Buhlmann, P. (2011). MissForest – non-parametric missing value imputation for mixed-type data. Bioinformatics, 28 (1), 112–118. doi: 10.1093/bioinformatics/btr597
  20. Gheyas, I. A., Smith, L. S. (2010). A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing, 73 (16-18), 3039–3065. doi: 10.1016/j.neucom.2010.06.021
  21. Slabchenko, O., Sydorenko, V. (2014). Analysis and synthesis of models on basis of machine learning for missing values imputation from social networks’ personal accounts. Visnik KrNU, 5, 105–111.
  22. Chekmyshev, O. A., Yashunskiy, A. D. (2014). Izvlecheniie i ispolzovaniie dannykh iz elektronnykh sotsialnykh setei. Moscow: IPM im. M. V. Keldysha, 16.
  23. Little, J. A., Rubin, D. B. (2002). Statistical Analysis with Missing Data. New Jersey: John Wiley & Sons, 408. doi: 10.1002/9781119013563
  24. Ferrari, P. A., Annoni, P., Barbiero, A., Manzi, G. (2011). An imputation method for categorical variables with application to nonlinear principal component analysis. Computational Statistics & Data Analysis, 55 (7), 2410–2420. doi: 10.1016/j.csda.2011.02.007
  25. Oba, S., Sato, M. -a., Takemasa, I., Monden, M., Matsubara, K. -i., Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19 (16), 2088–2096. doi: 10.1093/bioinformatics/btg287

Downloads

Published

2016-08-30

How to Cite

Slabchenko, O., Sydorenko, V., & Siebert, X. (2016). Development of models for imputation of data from social networks on the basis of an extended matrix of attributes. Eastern-European Journal of Enterprise Technologies, 4(2(82), 24–34. https://doi.org/10.15587/1729-4061.2016.74871