Stratejik yönetim perspektifinden sigortacılık sektöründe makine öğrenmesi algoritmaları ile anomali tespiti

thumbnail.default.alt
Tarih
2020-06-15
Yazarlar
Şahan, Ayşe Nurbanu
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Fen Bilimleri Enstitüsü
Institute of Science and Technology
Özet
Sigortacık sektöründe yaşanan sahtekârlık faaliyetleri, sigorta şirketlerine büyük kayıplar yaşatmaktadır. Otomobil ve mülkiyet sigortası, sigortacılık sektöründe sunulan sözleşmeli ana hizmetlerin toplam hasar maliyetlerinin %76'sını oluşturmaktadır. Literatür incelendiğinde, mülkiyet sigortasında sahtekârlık tespitinin büyük ölçüde ihmal edildiği ve ağırlıklı olarak araç sigortasında sahtekârlık tespiti üzerine çalışmalar yürütüldüğü görülmektedir. Büyük sigorta verilerinde sahtekârlık tespitine yönelik yürütülen çalışmalarda veri madenciliği tekniklerinden yararlanılmaktadır. Sahtekârlık kayıtları bulmada kullanılan en yaygın tekniklerden biri anomali tespitidir. Bu teknik, alışılmış paternlerden sapan aykırı değerleri veya anomalileri tespit etmeyi amaçlamaktadır Bu tezin amacı makine öğrenmesi algoritmalarını kullanarak mülkiyet sigortası alanında yüksek hasara sahip poliçelerin, oluşturulma aşamasında önceden tespitinin mümkün olup olmadığını, anomali tespiti yöntemleri ile incelemektir. Çalışma kapsamında sigorta şirketinden temin edilen mülkiyet sigortasına ait etiketsiz veriler incelenmiştir. Etiketli verilerin bulunmaması sebebi ile anomali tespiti aşamasında oluşturulan anomali senaryoları doğrultusunda denetimsiz öğrenme tekniklerinden yararlanılmış olup, veri etiketleri elde edilmiştir. Literatürde yer alan çalışmalarda hasar prim oranı değerlerinin sahtekârlık tespitinde etkisi olduğu görülmektedir. Tez kapsamında anomali senaryosu oluşturulurken yüksek hasar prim oranlarına sahip il, ilçeler ve anomali durumların tespitini destekleyen diğer değişkenler dikkate alınmıştır. Belirlenen bu anomali senaryosunda yer alan değişkenlerde farklılaşma gösteren gözlemleri bir araya getiren bir küme arayışı ile anomali tespitinin nihayetlendirilmesi hedeflenmiştir. Bu yaklaşım ile elde edilen küme elemanları anomali olarak işaretlenmiştir. Elde edilen etiketler ile, poliçe oluşturulma aşamasında anormal durumların sınıflandırılmasında kullanılacak tahmin modellerinin oluşturulmasına yönelik denetimli öğrenme tekniklerine başvurulmuştur. Poliçe oluşturulma aşamasında yoğunluklu olarak kategorik değişkenler özelinde bilgi girişi olmaktadır. Bu nedenle çalışılacak sınıflandırma algoritmaları belirlenirken kategorik değişken yoğunluklu veri setlerinde iyi performans verdiği bilinen ağaç temelli algoritmaların performansları incelenmiştir. Anomali durumların tespitinde model performanslarını geliştirmeye yönelik yeniden örnekleme yöntemleri kullanılmıştır. Çalışma sistematiği oluşturulurken ve takip edilmesi gereken adımlar belirlenirken Çapraz Endüstri Standart Süreç Modeli (CRISP-DM) yaklaşımı benimsenmiş olup R programlama dili kullanılmıştır. Çalışmada yüksek hasar prim oranlarının anomali olma durumunda ayrıştırıcı olduğu gözlemlenmiştir. Çalışma sonucunda, tahmin modelleri arasında en iyi performans gösteren algoritmanın, undersampling yeniden örnekleme yaklaşımı ile 60:40 oranında dengelenmiş veri setinde 0,92 F1 skora sahip olduğu tespit edilmiştir.
The main purpose of the insurance approach is to share the losses of the individuals that cannot be covered individually with other insured people in a similar position. Individuals who are in a similar position come together within a company or institution and the management of this process is carried out by the relevant institutions. The institutions undertaking this task are insurance companies. The relevant parties are individuals who request insurance from insurance companies; the insured, and the company providing the insurance; the insurer (underwriter). The insurance process involves decision making on both the insured's and the insurer's part to reach a policy agreement. First, the insured makes an application to the insurance company or credit agency. The application is evaluated by the insurance company and the insurance company makes the decision to approve or reject the application. A range of premium amounts are offered for the policies approved by the insurance company. The premium amount for the chosen policy is paid by the insured and the insurance contract (policy) is created. In case the insured has an accident, the damage caused by the accident is paid for by the insurance company according to the policy. More than a thousand companies operate in the insurance industry worldwide. Each year, these companies collect over a trillion dollars in premiums. 76% of the total damage costs of the main contracted services provided in the insurance industry are constituted by automobile and property insurance. Fraud activities in the insurance industry cause great losses to insurance companies. Insurance fraud is when businesses or individuals make false claims in order to gain compensation from their insurance companies. In the literature, it is seen that fraud detection in property insurance is largely neglected while studies on fraud detection are predominantly carried out in automobile insurance. Data mining techniques are used in studies carried out to detect fraud in large sets of insurance data. Anomaly detection is one of the most common techniques used in finding fraud records. This technique aims to detect anomalies or outliers deviating from conventional patterns. It can be said that the skewed data is one of the problems encountered in the studies carried out to detect fraud. The main source of this is that abnormal situations and normal situations do not spread evenly. It is seen that in most of the studies examined, the percentage of observations identified as fraud remained below 30% of the total data set. Some methods are used to artificially rebalance the class distribution against the problem of unbalanced distribution of classes. Some of these methods can be re-sampling methods such as sampling from the negative class, sampling from the positive class or creating synthetic positive samples as a pre-processing step before the training phase. Another problem that can be encountered in detecting fraud is that the fraudulent/ anomaly labels are not included in the data sets. In the studies carried out in order to detect fraud in unlabeled data, specialists are employed for labeling the fraudulent situations, different data groups with similar qualities are combined with master data or clustering approaches are used. Insurance companies generally carry out their studies for fraud detection through the method of retrospective classification of incoming claims. There are also studies that adopt the opposite of this approach. These studies were carried out to make the fraud detection during the underwriting process. In order to clearly understand the business dimension of this study on insurance data; interviews were held with experts working in the insurance industry. Sector reports, annual reports, and strategic plans of companies were examined. In addition, studies on the field in the literature have been scanned. As a result of these interviews and studies, comprehensive information was obtained to determine which parameters should be focused on during analysis. Especially preferred indicators to be followed in reaching the sector targets have been determined and these indicators have been used to create the data set that will be run on the model in the later stages of this study. Cross-industry standard process for data mining (CRISP-DM) approach has been adopted while establishing the working system and determining the steps to be followed. From a data science project management approach, CRISP-DM divides the life cycle of a data mining project into six stages. These steps are as follows; business understanding, data understanding, data preparation, modeling, evaluation, and deployment. This methodology, especially for beginners, guides the process model, how the work should progress, helps in structuring the project, and makes suggestions for each task involved in the process. It prevents overlooking/omitting any of the steps for each task through a determined checklist. The purpose of this thesis is to examine whether high loss ratios in the field of property insurance can be detected beforehand during the underwriting step with anomaly detection methods by using machine learning algorithms. Within the scope of the study, unlabeled data of fire insurance obtained from insurance companies were examined. Due to the lack of labeled data during the anomaly detection stage, data labels were obtained by using unsupervised learning techniques in line with the anomaly scenarios created. While creating anomaly scenarios within the scope of the thesis, provinces and districts with high loss ratios and other extracted features to support the detection of anomaly situations were taken into account. It is aimed to finalize the anomaly detection by seeking a cluster that brings together observations that differ in the variables in this anomaly scenario. The cluster elements obtained with this approach are labeled as anomalies. Supervised learning techniques were used to create prediction models for the classification of anomaly situations during the underwriting phase with the obtained labels. During the underwriting phase, there is extensive information related to categorical variables; so while determining the classification algorithms to be studied, the performances of tree-based algorithms were examined, which are known for having note-worthy performance in categorical variable dense data sets. Resampling methods were used to improve model performance in detecting anomaly situations. The cluster representing anomaly states were determined by the K-means cluster approach. Decision Tree, Bagging Tree, Random Forest, Xgboost, Lightgbm, Catboost are the tree-based algorithms whose performances were evaluated for the model, which is aimed to be developed for the detection of anomaly situations during the creation of policies. F1 scores were taken into consideration when comparing model performances. The study was carried out to capture the best model performance with the selected models and was designed in 5 stages. Firstly, models were run on the variables used during the underwriting phase. Here, in order to enable the model to respond more quickly, feature engineering was performed in multi-factor categorical variables such as province and district. With the defined 'same city' variable, whether the insured city and the risk city are the same or not were determined and included in the model. The same study was done for the district. In the second stage, the variables derived to detect excessive increases in production and losses are included in the model. While extracting the features following this increase, the total production quantity and loss quantity of one month before the relevant date were taken into consideration and the information from one week ago was taken into consideration. In this way, it was aimed to follow the sudden increase in production or loss. However, with these new features, the model performances did not reach the desired levels. In the third stage, daily loss ratios were calculated for some categorical variables to improve model performances (e.g. agency-based daily loss ratios). With these new variables, the conversion rates to loss of the policies produced within one day were followed. Still model performances did not reach the desired levels. In the fourth step, cross-validated hyper parameter optimization was performed to improve model performances. Considering the sensitivity values of the hyperparameter optimization, it was determined that the Lightgbm algorithm showed the highest performance in detecting true anomaly situations as an anomaly. When comparing F1 scores, Catboost was the most suitable model for the study carried out. However, detection percentages remained at approximately 30%. Due to the inability to reach the desired performance levels with the first four steps of the study, the final step was initiated. A resampling approach was used in the final step taking into account that the unbalanced data, which are encountered in many studies in the literature, are balanced by resampling and that the model performances are increased. The under-sampling method is adopted from the resampling approaches, and data performances at different balance levels are arranged and studied again. A significant improvement was observed in the results obtained using this method. When model outputs are analyzed with the same comparison parameters, considering the F1 score values when the dataset is balanced by 50:50, the highest performing algorithms are Random Forest, Catboost and Decision Tree. When the data set is balanced at 60:40, the Xgboost, Random Forest and Catboost algorithms were the highest performing. At 70:30 it was again determined that Random Forest, Xgboost and Catboost algorithms were the highest performing. Finally, Random Forest, Xgboost and Bagging algorithms came to the forefront with the 80:20 ratio. When the performances of the determined algorithms were compared according to the F1 scores obtained, considering all the ratios selected to balance the data set, the highest performance was observed in the XGBoost algorithm at a rate of 60:40. The most important variables determined by the XGBoost algorithm are taken into consideration as the factors that the insurance companies should consider while examining the anomaly of the requests received during the underwriting phase and developing strategies for these situations. The issues determined to be focused on by insurance companies are listed as follows; from which region the policy request comes from, whether there is a remarkable increase in the number of loss occurrences in the agency that constitutes the policy, the rate of conversion of daily loss ratios in the district where the policy request comes from, and under which tariff the policy was created. In the study, it was observed that high loss ratio is a distinguishing factor for anomalies. As a result of the study, it has been determined that the algorithm that performs best among the prediction models has an F1 score of 0.92 in the data set balanced with an under-sampling approach at a ratio of 60:40. R programming language was been used while completing this study. Decision Tree, Bagging Tree, Random Forest, Xgboost, Lightgbm, Catboost are the tree-based algorithms that were used in this study, however; artificial neural networks and deep learning methods are also recommended for future studies. While within the scope of this study, K-means was preferred at the clustering stage, an intensity based clustering approach can also be preferred in future studies. In this study, classifications were made by determining a single threshold value in ensemble methods while using classification algorithms. However, by using a variety of threshold values, it is possible to examine which threshold is the most appropriate for the problem. This is another area that can be examined in future studies.
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2020
Tez (Yüksek Lisans)-- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2020
Anahtar kelimeler
Sigortacılık sektörü, Makine öğrenmesi, Insurance industry, Machine learning
Alıntı