استفاده از خوشه‌بندی و رویکردی ترکیبی برای پرکردن مقادیر جاافتاده عددی

محورهای موضوعی : electrical and computer engineering

1 - دانشگاه تربیت دبیر شهیدرجایی

تاریخ دریافت : 1396/09/08 تاریخ پذیرش : 1396/09/08 تاریخ انتشار : 1396/09/07

کلید واژه: رگرسیون مقادیر جاافتاده نزدیک‌ترین همسایگان همبستگی,

چکیده مقاله :

تخمین مقادیر جاافتاده یک گام مهم در پیش‌پردازش داده‌ها است. در این مقاله ‌یک رویکرد دومرحله‌ای برای پرکردن مقادیر جاافتاده عددی ارائه شده است. در مرحله اول داده‌ها خوشه‌بندی می‌شوند و در مرحله دوم داده‌های جاافتاده درون هر خوشه با استفاده از یک روش‌ ترکیبی از k نزدیک‌ترین همسایه وزن‌دار و رگرسیون خطی تخمین زده می‌شوند. از معیار همبستگی بین صفات در هر خوشه برای تعیین روش پرکردن داده‌های جاافتاده استفاده می‌‌شود. کیفیت پرکردن مقادیر جاافتاده با استفاده از معیار میانگین مربعات خطا سنجیده می‌شود. تأثیر پارامترهای مختلف بر میزان خطای داده‌های تخمین زده شده بررسی می‌‌گردد. عملکرد روش ارائه‌شده برای تخمین داده‌های جاافتاده بر روی پنج مجموعه داده نیز‌ بررسی می‌شود. در نهایت عملکرد روش ارائه‌شده با چهار روش پرکردن با مقدار میانگین، روش تخمین با شبکه عصبی پرسپترون چندلایه (MLP)، روش پرکردن با خوشه‌بندی c-means فازی و روش k خوشه‌ و نزدیک‌ترین همسایه مبتنی بر دسته (CKNNI) مقایسه می‌شود. نتایج به دست آمده نشان داده‌ که خطای تخمین مقادیر جاافتاده در روش ارائه‌شده کمتر از خطا در دیگر روش‌های مقایسه‌شده است.

چکیده انگلیسی:

Estimation of missing values is an important step in the preprocessing. In this paper, at two-step approach is proposed to fill the numeric missing values. In the first step, data is clustered. In the second step, the missing data in each cluster are estimated using a combination of weighted k nearest neighbors and linear regression methods. The correlation measure is employed to determine the appropriate method for the filling of missing data in each cluster. The quality of estimated missing values is evaluated using the root mean squared error (RMSE) criterion. Effect of different input parameters on the error of estimated values is investigated. Moreover, the performance of the proposed method for the estimation purpose is evaluated on five datasets. Finally, the efficiency of the proposed method is compared to four different estimation methods, namely, Mean estimation, multi-layer perceptron (MLP) based estimation, fuzzy C-means (FCM) based approximation method, and Class-based K-clusters nearest neighbor imputation (CKNNI) method. Experimental results show that the proposed method produces less error in comparison to other compared methods, in most of the cases.

منابع و مأخذ:

اشتراک گذاری

آدرس مقاله

استفاده از خوشه‌بندی و رویکردی ترکیبی برای پرکردن مقادیر جاافتاده عددی