Development of models for imputation of data from social networks on the basis of an extended matrix of attributes
DOI:
https://doi.org/10.15587/1729-4061.2016.74871Keywords:
imputation of data, an extended matrix of attributes, an imputation model, an ensemble of modelsAbstract
Missing data cause various problems when processing and analysing real-world datasets. In this paper, we consider the structure of data from social networks’ accounts and introduce models of imputation and their ensembles designed for this area.
We have analysed the structure and strength of correlations between data from social networks’ accounts and suggested an approach to imputation on the basis of an extended matrix of attributes. We have justified a step of preliminary clustering when processing missing data, which helps overcome the problem of a large number of unique values in analysable variables. We have designed models of imputation on the basis of association rules, a random forest, a support vector machine, a neural network, and an EM algorithm while using preliminary clustering and an extended matrix of attributes. We have compared the performance of these models with the most popular method of imputation “Most Common Value” (MCV), which is usually integrated into statistical packages. These results demonstrate that the MCV method is not well suited for data from social networks’ accounts in terms of two evaluation criteria.
Using the suggested models, we have developed ensembles of models for imputation of nominal and numerical data types. We have shown that the ensembles of the models can handle missing values more effectively and stably in terms of the concerned evaluation criteria in comparison with the single models.
References
- Slabchenko, O. O., Sydorenko, V. N. (2013). Uluchsheniie kachestva iskhodnykh dannykh v zadachakh modelirovaniia internet-soobshchestv na osnove kompleksnogo primeneniia modelei segmentatsii, imputatsii i obogashcheniya dannykh. Visnik KrNU, 6, 50–58.
- Kossinets, G. (2006). Effects of missing data in social networks. Social Networks, 28 (3), 247–268. doi: 10.1016/j.socnet.2005.07.002
- Nakagawa, S., Freckleton, R. P. (2008). Missing inaction: the dangers of ignoring missing data. Trends in Ecology and Evolution, 23 (11), 592–596. doi: 10.1016/j.tree.2008.06.01
- Graham, J. W. (2009). Missing Data Analysis: Making It Work in the Real World. Annual Review of Psychology, 60 (1), 549–576. doi: 10.1146/annurev.psych.58.110405.085530
- Rhoads, C. H. (2012). Problems with Tests of the Missingness Mechanism in Quantitative Policy Studies. Statistics, Politics, and Policy, 3 (1). doi: 10.1515/2151-7509.1012
- Schafer, J. L., Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7 (2), 147–177. doi: 10.1037/1082-989x.7.2.147
- Schlomer, G. L., Bauman, S., Card, N. A. (2010). Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57 (1), 1–10. doi: 10.1037/a0018082
- Chang, C., Butar, F. B. (2013). Weighting Adjustments in Survey Sampling. European International Journal of Science and Technology, 2 (9), 214–236.
- Huisman, M. (2009). Journal of Social Structure, 10 (1), 1–10.
- Baraldi, A. N., Enders, C. K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48 (1), 5–37. doi: 10.1016/j.jsp.2009.10.001
- Andridge, R. R., Little, R. J. A. (2010). A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78 (1), 40–64. doi: 10.1111/j.1751-5823.2010.00103.x
- Schmitt, P., Mandel, J., Guedj, M. (2015). A Comparison of Six Methods for Missing Data Imputation. Journal of Biometrics & Biostatistics, 06 (01). doi: 10.4172/2155-6180.1000224
- Silva-Ramírez, E.-L., Pino-Mejías, R., López-Coello, M., Cubiles-de-la-Vega, M.-D. (2011). Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks, 24 (1), 121–129. doi: 10.1016/j.neunet.2010.09.008
- Aydilek, I. B., Arslan, A. (2013). A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Information Sciences, 233, 25–35. doi: 10.1016/j.ins.2013.01.021
- Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50 (2), 105–115. doi: 10.1016/j.artmed.2010.05.002
- Nanni, L., Lumini, A., Brahnam, S. (2012). A classifier ensemble approach for the missing feature problem. Artificial Intelligence in Medicine, 55 (1), 37–50. doi: 10.1016/j.artmed.2011.11.006
- Gebregziabher, M., DeSantis, S. M. (2010). Latent class based multiple imputation approach for missing categorical data. Journal of Statistical Planning and Inference, 140 (11), 3252–3262. doi: 10.1016/j.jspi.2010.04.020
- Senapti, R., Shaw, K., Mishra, S., Mishra, D. (2012). A Novel Approach for Missing Value Imputation and Classification of Microarray Dataset. Procedia Engineering, 38, 1067–1071. doi: 10.1016/j.proeng.2012.06.134
- Stekhoven, D. J., Buhlmann, P. (2011). MissForest – non-parametric missing value imputation for mixed-type data. Bioinformatics, 28 (1), 112–118. doi: 10.1093/bioinformatics/btr597
- Gheyas, I. A., Smith, L. S. (2010). A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing, 73 (16-18), 3039–3065. doi: 10.1016/j.neucom.2010.06.021
- Slabchenko, O., Sydorenko, V. (2014). Analysis and synthesis of models on basis of machine learning for missing values imputation from social networks’ personal accounts. Visnik KrNU, 5, 105–111.
- Chekmyshev, O. A., Yashunskiy, A. D. (2014). Izvlecheniie i ispolzovaniie dannykh iz elektronnykh sotsialnykh setei. Moscow: IPM im. M. V. Keldysha, 16.
- Little, J. A., Rubin, D. B. (2002). Statistical Analysis with Missing Data. New Jersey: John Wiley & Sons, 408. doi: 10.1002/9781119013563
- Ferrari, P. A., Annoni, P., Barbiero, A., Manzi, G. (2011). An imputation method for categorical variables with application to nonlinear principal component analysis. Computational Statistics & Data Analysis, 55 (7), 2410–2420. doi: 10.1016/j.csda.2011.02.007
- Oba, S., Sato, M. -a., Takemasa, I., Monden, M., Matsubara, K. -i., Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19 (16), 2088–2096. doi: 10.1093/bioinformatics/btg287
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2016 Slabchenko Olesia, Valeriy Sydorenko, Xavier Siebert
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.