LEE- Büyük Veri ve İş Analitiği Lisansüstü Programı

Bu topluluk için Kalıcı Uri

http://hdl.handle.net/11527/25678

Gözat

Comperative evaluation of unsupervised fraud detection algorithms with feature extraction and scaling in purchasing domain

(Graduate School, 2024-08-21) Taşoğlu, Yiğit Can ; Ergün, Mehmet Ali ; 528211079 ; Big Data and Business Analytics

The main aim of the research is to evaluate and compare various unsupervised outlier detection methods that do not require labeled data, making them suitable for real-world purchasing data where labels are often unavailable. The thesis highlights the challenges of fraud detection in large datasets, particularly in industries like finance and purchasing, where fraudulent activities can cause significant financial losses if not identified early. The motivation behind the research lies in the limitations of traditional, rule-based detection methods, which often fail to capture complex fraud patterns. Unsupervised algorithms, which can detect anomalies based on their deviation from the general behavior of the dataset, offer a proactive approach to fraud detection by identifying unseen fraud concepts. This study applies various methods, including distance-based, machine learning-based, and feature-based models, and focuses on enhancing these models through feature extraction and scaling techniques. The thesis evaluates several algorithms, such as Local Outlier Factor (LOF), DBSCAN, and Isolation Forest, using performance metrics like accuracy, precision, recall, and F1 score. LOF was identified as the most effective model, achieving the highest accuracy and demonstrating a robust ability to detect irregular patterns in the purchasing data. However, the effectiveness of all algorithms was significantly enhanced by data transformations, particularly scaling. Scaling ensures that features with differing magnitudes, such as quantities and prices, do not distort the results, allowing for more accurate anomaly detection. The importance of feature extraction is also emphasized, as it helps identify intricate patterns between data points. Extracted features, such as the frequency of purchase orders, vendor categories, and purchase amounts, provide deeper insights into potential fraud indicators. Additionally, the study recognizes that the integration of multiple models can reduce the limitations inherent in individual algorithms, thus creating a more comprehensive fraud detection framework. By combining different unsupervised methods and leveraging feature extraction, the research offers a more adaptive and reliable approach to identifying fraudulent activities. In conclusion, this study proves that employing a combination of unsupervised outlier detection methods, along with appropriate data preprocessing techniques, significantly improves fraud detection in purchasing systems. These methods not only enhance accuracy but also help businesses reduce financial risks and improve operational efficiency, ensuring a more secure and effective fraud prevention strategy.
New proposed methods for synthetic minority over-sampling technique

(Graduate School, 2024-08-21) Korul, Hakan ; Ergün, Mehmet Ali ; 528211075 ; Big Data and Business Analytics

Machine Learning (ML), artificial intelligence (AI), aims to mimic human learning processes using data and algorithms. Machine learning algorithms are trained on data, which can be labeled or unlabeled, with the aim of finding patterns by considering each data variable and its corresponding label. Algorithms trained on data predict future data labels based on patterns learned from previously observed data variables. In machine learning algorithms, there is an error function that evaluates the algorithm's performance on the dataset, and to improve performance, the algorithm tries to find the one with a lower error value than the error function obtained previously. The proliferation of ML algorithms and their utilization across various fields in the literature has led to the resolution of many challenging problems through machine learning. These problems can generally be categorized as supervised and unsupervised learning. Supervised learning, where data labels are known, can be further classified into regression and classification. Regression algorithms are used for problems such as predicting stock prices, predicting house prices, and forecasting weather conditions, while classification algorithms are employed for tasks like customer churn prediction, email spam detection, and image recognition. Problems where labels are not available in the dataset fall under unsupervised learning, which typically includes tasks like customer segmentation. Ensuring machines are trained properly and achieve high performance relies heavily on the quality of the data. The data used for machine learning ideally should be clean, balanced, and representative of real-world conditions. Additionally, the size of the dataset is crucial, as larger datasets usually lead to better results. Alongside these data characteristics, having balanced classes in classification problems is vital for classification machine learning algorithms. When it comes to classification problems, having balanced class distributions in the data is crucial for classification machine learning algorithms. If the class distribution is imbalanced, it leads to what is known as an imbalanced dataset. Training a machine learning model on a problem with balanced classes as opposed to imbalanced classes can result in noticeable differences in performance. For instance, in a problem with balanced classes, where 4 out of 10 customers churn, meaning almost a 50% probability, even a model with random guessing can make successful predictions. However, if the problem is such that only 1 out of 100 customers churn, random guessing becomes insufficient, posing an extremely challenging problem for machine learning models. To address this issue, various methods have been developed. In the example mentioned above, where 99 customers do not churn and 1 customer churns, these problems are typically categorized into majority and minority classes. These different methods can essentially be classified into three categories: increasing the number of minority class samples to approach the number of majority class samples (oversampling), decreasing the number of majority class samples to approach the number of minority class samples (undersampling), or using a combination of both methods (hybrid sampling). Directly comparing oversampling and undersampling methods can be quite challenging, as these methods may perform differently on different datasets. However, certain characteristics of the dataset can help expedite the selection of the appropriate method. For example, if the dataset is not very large, undersampling may not be suitable because it would eliminate majority class samples, resulting in a significantly smaller dataset. In such cases, oversampling methods that increase the overall dataset size should be preferred. Oversampling methods can be further divided into random oversampling and SMOTE. The most significant difference between the two methods lies in how they handle the generation of synthetic data. While random oversampling duplicates existing minority class samples in the dataset, SMOTE generates synthetic data points close to existing minority class samples using various techniques. Since random oversampling does not introduce new examples to the machine learning algorithm and can lead to overfitting, SMOTE methods generally yield better performance improvements. In the data generation stage with the SMOTE method, first, the number of k nearest neighbors is determined for each sample. To generate synthetic data, two examples are selected: the first is the minority sample itself, and the second is a randomly chosen minority class example among its k nearest neighbors. Then, the difference between the neighbor example and the minority example is calculated. Subsequently, this difference is multiplied by a random number between 0 and 1 and added back to the minority sample. This process ensures that the newly generated synthetic data is placed between the minority sample and one of its k nearest neighbors randomly selected. Following the emergence of the SMOTE method in the literature, efforts have been focused on developing this method from various perspectives. Borderline SMOTE and K-Means SMOTE are just some of the newly proposed methods in this context. In the Borderline SMOTE method, a border between the minority and majority classes is defined, and synthetic data generation is focused only on the minority samples on this borderline. In the K-Means SMOTE method, the K-means technique is utilized, where the training set is divided into k clusters, and then decisions are made on which clusters to use for oversampling, followed by applying SMOTE steps within these clusters. In my research, I developed three different new SMOTE methods. The first one, Genetic SMOTE, focuses on variable alteration using the crossover logic commonly used in genetic algorithms between a sample and its neighbor. Randomly selected features of the sample are chosen, then these features are taken from the neighbor, while the remaining features are directly taken from the sample, resulting in the generation of new synthetic data of the same size. The second one, Dual Borderline SMOTE, is inspired by the Borderline SMOTE method. Instead of considering the ratio of minority samples within the total number of neighbors applied for Borderline, which ranges between 0.5 and 1, I adjusted it to range between 0 and 1. The significant difference between this method and Borderline SMOTE is that, while Borderline SMOTE does not consider whether the neighbors of the minority sample are on the border after finding a minority sample on the border, in this method, both the sample and its neighbor must be on the newly defined border. The last method developed is Genetic Dual Borderline SMOTE, which combines these two methods. This method applies the Dual Borderline SMOTE rules for selecting minority samples and their close minority sample neighbors, meaning that both the selected sample and its neighbor must lie on the defined borderline. When generating synthetic data from the selected two examples, instead of applying SMOTE as in the Dual Borderline SMOTE method, it focuses on variable alteration similar to the Genetic SMOTE steps. In the final section, to compare the performance of the three developed methods with the other three methods used, 8 datasets and 4 different machine learning algorithms were employed. For each dataset and algorithm, a total of 6 different SMOTE methods were evaluated for performance. During performance comparison, various parameter combinations of machine learning algorithms and SMOTE methods were tested for each dataset, and the parameters yielding the best results were selected for each dataset, model, and SMOTE method. The success of the SMOTE methods was compared using the parameters that provided the best F1 score. In the performance measurement, the F1 score of the minority class was preferred over metrics like accuracy, which are not suitable for imbalanced datasets. To reduce randomness and ensure a more reliable model, a 5-fold cross-validation score was used to calculate the F1 score. A total of 32 F1 scores (8 datasets, 4 machine learning algorithms) were obtained for each SMOTE method, and the F1 scores were illustrated in figures either overall or based on the model. Additionally, each dataset and machine learning algorithm were ranked from 1 to 6 for SMOTE methods, and the rankings were compared in the study. It has been observed that the newly developed methods outperform the existing ones. In model-based analyses, while there is not a significant difference in the performance of SMOTE methods in linear models, especially in ensemble methods, the Genetic Dual Borderline SMOTE method stands out noticeably from the others.
Overcoming payment behavior challenges: Classifying buy now pay later users with machine learning

(Graduate School, 2024-08-08) Özdoğan, Ömür ; Ergün, Mehmet Ali ; 528211073 ; Big Data and Business Analytics

When the economies of developing countries are examined closely, issues such as high debt rates, limited access to funding and difficulties in accessing financing stand out. While Turkey is considered in the category of developing countries, various steps have been taken to strengthen financial stability within the scope of financial tightening policies in recent periods. These steps include steps such as increasing the risk weights of consumer loans, individual credit cards and vehicle loans, and reducing or eliminating the installments on credit cards in certain sectors. Thus, consumers' financial access continues to be increasingly restricted. At this point, Fintech companies offer different alternatives as another option to the restrictions in the traditional financial system. Fintech companies aim to offer low cost, fast and innovative solutions to their customers by using modern data analysis techniques and new generation technologies such as AI. Additionally, these companies aim to reach people who have limited access or are looking for alternatives, in addition to users who do not have access to the banking sector. In this study, first, the development and future of Fintech companies are mentioned and the main product of the study, Buy Now Pay Later, is explained in detail and examples from global BNPL providers are given. Then, credit risk and machine learning techniques in credit risk were also mentioned. During the study, data analysis was performed using sample data of approximately 35,000 customers using the BNPL loan product developed by a London-based Fintech company for one of Turkey's leading retail market chains. A multi-class segmentation problem was designed for users, and popular machine learning methods were used to predict in 3 different classes whether the loans will be paid on time or not. During the experimental phase, Random Forest, Extreme Gradient Boosting and Deep Learning algorithms were tested and performance tables for the models were prepared. When comparing model performances, values such as accuracy, f1-score, roc curve and confusion matrix were determined as priority metrics. As a result, it has been determined that the use of machine learning modeling is quite advantageous when classifying credit risk. However, before deciding on model selection, companies' strategies and policies should always be taken into consideration, and at the same time, the bad loan risks they want to take should be evaluated.
Predicting customers with higher probability to purchase in telecom industry

(Graduate School, 2025-01-22) Yıldırım, Güzide Nur ; Ergün, Mehmet Ali ; 528211085 ; Big Data and Business Analytics

Understanding customer insights, predicting their next steps, and making accurate recommendations have become a significant issue for customer service organizations in today's world. With the widespread use of important technologies such as Big Data Analytics and Machine Learning, the contribution of Artificial Intelligence to digital transformation in the industrial world has significantly affected the way businesses work. The use of Artificial Intelligence technologies in business processes has increased the success ratio of marketing strategies, business models and sales statistics. Companies that provide customer services have units responsible for CRM, as known as customer relations management. Big Data accumulated and analyzed by CRM units can be modeled with artificial intelligence algorithms for customer centric use cases. Thereby, outcome of such an approach guide the customer strategies built by companies. In this article, studies on the business analytic-based understanding of customers with high probability of purchasing by training the CRM data used in the telecom industry will be included.
Predicting stock prices in bist: A reinforcement learning and sentimental analysis approach

(Graduate School, 2024-08-08) Eğe, Şeyma ; Ergün, Mehmet Ali ; 528211080 ; Big Data and Business Analytics

The stock exchange is an environment where the buying and selling of financial products take place through intermediaries. In financial markets, the most traded financial instruments are equity shares. Investors aim to enhance their profitability by investing in this financial instrument but due to the high volatility of stock prices, they are a high-risk financial product. For this reason, researchers from different fields have conducted studies on predicting stock prices. Many methods have been developed in studies on this topic from past to present. In literature, statistical methods such as AutoRegressive (AR), AutoRegressive Integrated Moving Average (ARIMA), AutoRegressive Conditional Heteroskedasticity (ARCH), Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) have been commonly used in research. Researchers have noted that sudden price fluctuations or unexpected events can disrupt prediction performance when employing these methods. Recently, numerous prediction models have emerged in research, leveraging machine learning and deep learning algorithms. In complex scenarios, datasets containing abrupt price changes and unexpected events have sometimes yielded improved results with these algorithms. In existing literature, variables derived from fundamental and technical analyses have been commonly employed. Additionally, sentiment analysis-based variables have been utilized. The overarching goal has been to capture the impact of geopolitical, global, and economic indicators on stock prices within these models. In research on predicting stock prices in the context of Borsa Istanbul, there are deficiencies in the existing literature, especially regarding the use of methods such as reinforcement learning, technical indicators and sentimental analysis in the same study. This study aims to address this gap by investigating predictive indicators and sentimental analysis along with various reinforcement learning techniques. Our research involves predicting price changes for specific stocks traded in Borsa Istanbul using deep reinforcement learning techniques (DQN, DDQN, DDDQN). The performance evaluation of these techniques was conducted, and a buy-and-hold strategy was used as a benchmark for comparison. For this study, shares of 8 companies traded on Borsa Istanbul were selected. Daily information of these stocks between January 2021 and March 2024 and technical indicators calculated from this information are used in the data. Apart from this information, sentimental analysis of the material event disclosures reported by the companies to the Public Disclosure Platform was made with the distilroberta model. A separate dataset was prepared for each company stock. DQN, DDQN, DDDQN algorithms were used in the study. When the test period results are examined, profit and reward performances differ on a stock basis. Notably, the DDQN algorithm stood out in securing rewards for six of the eight selected stocks. The results indicate that model performance varies significantly across different stocks. While the DDQN and DDDQN models were successful in certain situations, the Buy-and-Hold strategy proved more effective for some stocks. This suggests that model-based strategies can outperform Buy-and-Hold under certain conditions, but their effectiveness is largely dependent on the specific stock in question. Additionally, the impact of the Kap score on model performance was evaluated. It was observed that the inclusion of the cap score as an input generally enhanced the performance of the DQN, DDQN, and DDDQN models. Obtained results have led to conclusions regarding future studies.

Gözat

Yazar "Ergün, Mehmet Ali" ile LEE- Büyük Veri ve İş Analitiği Lisansüstü Programı'a göz atma

Sayfa başına sonuç

Sıralama Seçenekleri