Development of models for imputation of data from social networks on the basis of an extended matrix of attributes

Olesia Slabchenko; Valeriy Sydorenko; Xavier Siebert

doi:10.15587/1729-4061.2016.74871

Authors

Olesia Slabchenko Kremenchuk Mykhailo Ostohradskyi National University Pershotravneva str., 20, Kremenchuk, Ukraine, 39600, Ukraine https://orcid.org/0000-0001-5408-7445
Valeriy Sydorenko Kremenchuk Mykhailo Ostohradskyi National University Pershotravneva str., 20, Kremenchuk, Ukraine, 39600, Ukraine https://orcid.org/0000-0002-4449-073X
Xavier Siebert University of Mons, Belgium rue de Houdain, 9, Mons, Belgium, 7000, Belgium https://orcid.org/0000-0003-0869-7968

DOI:

https://doi.org/10.15587/1729-4061.2016.74871

Keywords:

imputation of data, an extended matrix of attributes, an imputation model, an ensemble of models

Abstract

Missing data cause various problems when processing and analysing real-world datasets. In this paper, we consider the structure of data from social networks’ accounts and introduce models of imputation and their ensembles designed for this area.

We have analysed the structure and strength of correlations between data from social networks’ accounts and suggested an approach to imputation on the basis of an extended matrix of attributes. We have justified a step of preliminary clustering when processing missing data, which helps overcome the problem of a large number of unique values in analysable variables. We have designed models of imputation on the basis of association rules, a random forest, a support vector machine, a neural network, and an EM algorithm while using preliminary clustering and an extended matrix of attributes. We have compared the performance of these models with the most popular method of imputation “Most Common Value” (MCV), which is usually integrated into statistical packages. These results demonstrate that the MCV method is not well suited for data from social networks’ accounts in terms of two evaluation criteria.

Using the suggested models, we have developed ensembles of models for imputation of nominal and numerical data types. We have shown that the ensembles of the models can handle missing values more effectively and stably in terms of the concerned evaluation criteria in comparison with the single models.

Author Biographies

Olesia Slabchenko, Kremenchuk Mykhailo Ostohradskyi National University Pershotravneva str., 20, Kremenchuk, Ukraine, 39600

Computer and information systems department

Valeriy Sydorenko, Kremenchuk Mykhailo Ostohradskyi National University Pershotravneva str., 20, Kremenchuk, Ukraine, 39600

Associate professor, PhD

Computer and information systems department

Xavier Siebert, University of Mons, Belgium rue de Houdain, 9, Mons, Belgium, 7000

Associate professor, PhD

Mathematics and Operational Research Department

References

Slabchenko, O. O., Sydorenko, V. N. (2013). Uluchsheniie kachestva iskhodnykh dannykh v zadachakh modelirovaniia internet-soobshchestv na osnove kompleksnogo primeneniia modelei segmentatsii, imputatsii i obogashcheniya dannykh. Visnik KrNU, 6, 50–58.
Kossinets, G. (2006). Effects of missing data in social networks. Social Networks, 28 (3), 247–268. doi: 10.1016/j.socnet.2005.07.002
Nakagawa, S., Freckleton, R. P. (2008). Missing inaction: the dangers of ignoring missing data. Trends in Ecology and Evolution, 23 (11), 592–596. doi: 10.1016/j.tree.2008.06.01
Graham, J. W. (2009). Missing Data Analysis: Making It Work in the Real World. Annual Review of Psychology, 60 (1), 549–576. doi: 10.1146/annurev.psych.58.110405.085530
Rhoads, C. H. (2012). Problems with Tests of the Missingness Mechanism in Quantitative Policy Studies. Statistics, Politics, and Policy, 3 (1). doi: 10.1515/2151-7509.1012
Schafer, J. L., Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7 (2), 147–177. doi: 10.1037/1082-989x.7.2.147
Schlomer, G. L., Bauman, S., Card, N. A. (2010). Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57 (1), 1–10. doi: 10.1037/a0018082
Chang, C., Butar, F. B. (2013). Weighting Adjustments in Survey Sampling. European International Journal of Science and Technology, 2 (9), 214–236.
Huisman, M. (2009). Journal of Social Structure, 10 (1), 1–10.
Baraldi, A. N., Enders, C. K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48 (1), 5–37. doi: 10.1016/j.jsp.2009.10.001
Andridge, R. R., Little, R. J. A. (2010). A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78 (1), 40–64. doi: 10.1111/j.1751-5823.2010.00103.x
Schmitt, P., Mandel, J., Guedj, M. (2015). A Comparison of Six Methods for Missing Data Imputation. Journal of Biometrics & Biostatistics, 06 (01). doi: 10.4172/2155-6180.1000224
Silva-Ramírez, E.-L., Pino-Mejías, R., López-Coello, M., Cubiles-de-la-Vega, M.-D. (2011). Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks, 24 (1), 121–129. doi: 10.1016/j.neunet.2010.09.008
Aydilek, I. B., Arslan, A. (2013). A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Information Sciences, 233, 25–35. doi: 10.1016/j.ins.2013.01.021
Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50 (2), 105–115. doi: 10.1016/j.artmed.2010.05.002
Nanni, L., Lumini, A., Brahnam, S. (2012). A classifier ensemble approach for the missing feature problem. Artificial Intelligence in Medicine, 55 (1), 37–50. doi: 10.1016/j.artmed.2011.11.006
Gebregziabher, M., DeSantis, S. M. (2010). Latent class based multiple imputation approach for missing categorical data. Journal of Statistical Planning and Inference, 140 (11), 3252–3262. doi: 10.1016/j.jspi.2010.04.020
Senapti, R., Shaw, K., Mishra, S., Mishra, D. (2012). A Novel Approach for Missing Value Imputation and Classification of Microarray Dataset. Procedia Engineering, 38, 1067–1071. doi: 10.1016/j.proeng.2012.06.134
Stekhoven, D. J., Buhlmann, P. (2011). MissForest – non-parametric missing value imputation for mixed-type data. Bioinformatics, 28 (1), 112–118. doi: 10.1093/bioinformatics/btr597
Gheyas, I. A., Smith, L. S. (2010). A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing, 73 (16-18), 3039–3065. doi: 10.1016/j.neucom.2010.06.021
Slabchenko, O., Sydorenko, V. (2014). Analysis and synthesis of models on basis of machine learning for missing values imputation from social networks’ personal accounts. Visnik KrNU, 5, 105–111.
Chekmyshev, O. A., Yashunskiy, A. D. (2014). Izvlecheniie i ispolzovaniie dannykh iz elektronnykh sotsialnykh setei. Moscow: IPM im. M. V. Keldysha, 16.
Little, J. A., Rubin, D. B. (2002). Statistical Analysis with Missing Data. New Jersey: John Wiley & Sons, 408. doi: 10.1002/9781119013563
Ferrari, P. A., Annoni, P., Barbiero, A., Manzi, G. (2011). An imputation method for categorical variables with application to nonlinear principal component analysis. Computational Statistics & Data Analysis, 55 (7), 2410–2420. doi: 10.1016/j.csda.2011.02.007
Oba, S., Sato, M. -a., Takemasa, I., Monden, M., Matsubara, K. -i., Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19 (16), 2088–2096. doi: 10.1093/bioinformatics/btg287

Development of models for imputation of data from social networks on the basis of an extended matrix of attributes

Authors

DOI:

Keywords:

Abstract

Author Biographies

Olesia Slabchenko, Kremenchuk Mykhailo Ostohradskyi National University Pershotravneva str., 20, Kremenchuk, Ukraine, 39600

Valeriy Sydorenko, Kremenchuk Mykhailo Ostohradskyi National University Pershotravneva str., 20, Kremenchuk, Ukraine, 39600

Xavier Siebert, University of Mons, Belgium rue de Houdain, 9, Mons, Belgium, 7000

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By