Method of outliers removal based on the weighted training samples of w-objects
DOI:
https://doi.org/10.15587/1729-4061.2014.24331Keywords:
training sample, data filtering, outlier, w-object, decision rule, generating setAbstract
The problem of preprocessing training samples to improve the efficiency of trainable recognition systems is considered in the paper. A new method for solving the problem of outliers removal based on constructing weighted reduced samples of w-objects is proposed. The wGridDC method for constructing the weighted sample of w-objects by superimposing the grid features on the space and constructing weighted objects of new sample by analyzing the contents of cells is used as a basis for the proposed method.
Within the proposed method, two outliers removal algorithms are developed. The algorithm for constructing the weighted training sample of w-objects with simultaneous outliers removal at a given filtering threshold is focused on the use in the tasks that require not only filtering the original data, but also controlling the size of the sample. Herewith, filtering threshold is user-defined. The algorithm for constructing the weighted training sample of w-objects with simultaneous outliers removal at automatic filtering threshold detection is focused on the tasks that require constructing samples, providing the highest efficiency of the system.
Analysis of the effectiveness of the proposed method has shown that the main advantage of the threshold filtering algorithm is the ability to control the size of the sample. The main advantage of the non-threshold filtering algorithm is the ability to automatically select the value of the filtering threshold that provides the greatest efficiency of the recognition system as a whole. Thus, the proposed method in general and both its constituent algorithms allow to obtain the samples, providing high efficiency of trainable recognition systems.References
- Larose, D. T. Discovering knowledge in data: an introduction to data mining [Text] / D. T. Larose. – New Jersey: John Wiley & Sons Inc., 2005. – 240 p.
- Giudici, P. Applied data mining: statistical methods for business and industry [Text] / P. Giudici. – Chichester: John Wiley & Sons Inc., 2003. – 380 p.
- Last, M. Knowledge discovery in time series databases [Text] / M. Last, Y. Klein, A. Kandel. – IEEE Transactions on Systems, man and cybernetics, 2000. – P. 60–69.
- Pal, S. K. Pattern Recognition Algorithms for Data Mining: Scalability, Knowledge Discovery and Soft Granular Computing [Text] / S. K. Pal, P. Mitra. – Chapman and Hall/CRC, 2004. – 280 p.
- Дюличева, Ю. Ю. О задачах фильтрации обучающих данных [Текст] / Ю. Ю. Дюличева // Искусственный интеллект. – 2006. – № 2. – 65–71.
- John, G. H. Robust Decision Trees: Removing Outliers from Databases [Text] / G. H. John // Knowledge Discovery and Data Mining. – 1995. – P. 174–179.
- Zagoruiko, N. G. Methods of Recognition Based on the Function of Rival Similarity [Text] / N. G. Zagoruiko , I. A. Borisova, V. V. Dyubanov, O. A. Kutnenko // Pattern Recognition and Image Analysis. – 2008. – Vol. 18, №.1. – P. 1–6.
- Розробка теоретичних засад і методів реалізації відкритих систем автоматичного розпізнавання, що навчаються: способи оптимізації навчаючих вибірок і методи побудови зважених вирішуючих правил класифікації [Текст] / звіт з НДР (заключний) : Тема GP/F32/130, Грант Президента України для підтримки наукових досліджень молодих учених на 2011 рік; керівник О.В. Волченко. – 0111U007107 – Донецьк, ДВНЗ «ДонНТУ», 2011. – 67 с.
- Волченко, Е. В. Сеточный подход к построению взвешенных обучающих выборок w-объектов в адаптивных системах распознавания [Текст] / Е. В. Волченко // Вісник Національного технічного університету "Харківський політехнічний інститут". Збірник наукових праць. Тематичний випуск: Інформатика i моделювання. – 2011. – № 36. – С. 12–22.
- Волченко, Е. В. О способе определения близости объектов взвешенных обучающих выборок [Текст] / Е. В. Волченко // Вісник Національного технічного університету "Харківський політехнічний інститут". Збірник наукових праць. Тематичний випуск: Інформатика i моделювання. – 2012. – № 38. – С. 38–45.
- Larose, D. T. (2005). Discovering knowledge in data: an introduction to data mining. New Jersey: John Wiley & Sons Inc., 240.
- Giudici, P. (2003). Applied data mining: statistical methods for business and industry. Chichester: John Wiley & Sons Inc., 380.
- Last, M., Klein, Y., Kandel, A. (2000). Knowledge discovery in time series databases. IEEE Transactions on Systems, man and cybernetics, 60–69.
- Pal, S. K., Mitra, P. (2004). Pattern Recognition Algorithms for Data Mining: Scalability, Knowledge Discovery and Soft Granular Computing. Chapman and Hall/CRC, 280.
- Dyulicheva, Yu. Yu. (2006). About Filtering Problems of Training Sample. Artificial Intelligence, 2, 65–71.
- John, G. H. (1995). Robust Decision Trees: Removing Outliers from Databases. Knowledge Discovery and Data Mining, 174–179.
- Zagoruiko, N. G., Borisova, I. A., Dyubanov, V. V., Kutnenko, O. A. (2008). Methods of Recognition Based on the Function of Rival Similarity. Pattern Recognition and Image Analysis, 18 (1), 1–6.
- Volchenko, E. V. (2011). Development of theoretical principles and methods of realization the open trained system of automatic recognition: methods of optimization the training samples and methods of construction the weighted decision rules of classification. Technical Report 0111U007107, 67.
- Volchenko, E. V. (2011). Grid approach to the construction of weighted training samples of w-objects in adaptive recognition systems. Herald of the National Technical University "KhPI". Subject issue: Information Science and Modelling, 36, 12–22.
- Volchenko, E. V. (2012). Method for determining the proximity of objects of weighted training samples. Herald of the National Technical University "KhPI". Subject issue: Information Science and Modelling, 38, 38–45.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2014 Елена Владимировна Волченко
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.