LEE- Büyük Veri ve İş Analitiği Lisansüstü Programı
Bu topluluk için Kalıcı Uri
Gözat
Konu "büyük veri" ile LEE- Büyük Veri ve İş Analitiği Lisansüstü Programı'a göz atma
Sayfa başına sonuç
Sıralama Seçenekleri
-
ÖgeComperative evaluation of unsupervised fraud detection algorithms with feature extraction and scaling in purchasing domain(Graduate School, 2024-08-21) Taşoğlu, Yiğit Can ; Ergün, Mehmet Ali ; 528211079 ; Big Data and Business AnalyticsThe main aim of the research is to evaluate and compare various unsupervised outlier detection methods that do not require labeled data, making them suitable for real-world purchasing data where labels are often unavailable. The thesis highlights the challenges of fraud detection in large datasets, particularly in industries like finance and purchasing, where fraudulent activities can cause significant financial losses if not identified early. The motivation behind the research lies in the limitations of traditional, rule-based detection methods, which often fail to capture complex fraud patterns. Unsupervised algorithms, which can detect anomalies based on their deviation from the general behavior of the dataset, offer a proactive approach to fraud detection by identifying unseen fraud concepts. This study applies various methods, including distance-based, machine learning-based, and feature-based models, and focuses on enhancing these models through feature extraction and scaling techniques. The thesis evaluates several algorithms, such as Local Outlier Factor (LOF), DBSCAN, and Isolation Forest, using performance metrics like accuracy, precision, recall, and F1 score. LOF was identified as the most effective model, achieving the highest accuracy and demonstrating a robust ability to detect irregular patterns in the purchasing data. However, the effectiveness of all algorithms was significantly enhanced by data transformations, particularly scaling. Scaling ensures that features with differing magnitudes, such as quantities and prices, do not distort the results, allowing for more accurate anomaly detection. The importance of feature extraction is also emphasized, as it helps identify intricate patterns between data points. Extracted features, such as the frequency of purchase orders, vendor categories, and purchase amounts, provide deeper insights into potential fraud indicators. Additionally, the study recognizes that the integration of multiple models can reduce the limitations inherent in individual algorithms, thus creating a more comprehensive fraud detection framework. By combining different unsupervised methods and leveraging feature extraction, the research offers a more adaptive and reliable approach to identifying fraudulent activities. In conclusion, this study proves that employing a combination of unsupervised outlier detection methods, along with appropriate data preprocessing techniques, significantly improves fraud detection in purchasing systems. These methods not only enhance accuracy but also help businesses reduce financial risks and improve operational efficiency, ensuring a more secure and effective fraud prevention strategy.
-
ÖgeCustomer lifetime value prediction and segmentation analysis for commercial customers in the banking industry(Graduate School, 2024-08-12) Bakır Tartar, Feyza ; Tuna, Süha ; 528221082 ; Big Data and Business AnalyticsIn Türkiye, the banking sector plays a pivotal role in the growth of the national economy and maintenance of financial stability. Understanding and evaluating the behavior of corporate customers in the banking sector requires an accurate and comprehensive analysis. Customer Lifetime Value is a critical metric for understanding customer attitudes, therefore its accurate calculation is crucial. In this thesis, data on corporate customers from a company operating as a participation bank in Türkiye were used to create Customer Lifetime Value scores using various analytical techniques. In the first stage of the study, data from the past two years for each customer were collected from relevant databases and organized on a quarterly basis to calculate Customer Lifetime Value scores. The dataset was checked for missing values, inconsistencies, and errors. Outliers were identified using the z-score method during the data-cleaning process. In this study, data points with an absolute z-score greater than 3 were considered outliers. This threshold is a commonly used rule when assessing whether data conform to a normal distribution and helps minimize the impact of extreme values. These outliers are typically data points that could distort the overall structure of the dataset and negatively affect the model performance; thus, they were removed from the dataset. Before proceeding to the modeling phase, a standard scaling process was implemented to standardize the data. Scaling is a preprocessing step performed to eliminate problems arising from data with different units of measurement and to ensure that all data are on the same scale. Following the scaling process, the Customer Lifetime Value was predicted using five distinct machine learning algorithms. In this process, the outcomes of the Random Forest, Light Gradient-Boosting Machine, Extreme Gradient Boosting, Elastic-Net, and Linear Regression algorithms were assessed. To assess the model results, the values of Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, R2, and the Adjusted R2 were used. These metrics were employed to utilized how well the model predictions aligned with the target variable. Upon analyzing the results, the highest R2 score was found to be 0.55 with the Random Forest algorithm, with Light Gradient-Boosting Machine being the second-best algorithm. Following the evaluation of the results, parameter optimization was performed using the Grid Search Cross Validation model to enhance the model performance and achieve the best possible results. Grid Search Cross Validation is a technique that explores all possible combinations within a specified parameter set to identify hyperparameters that deliver the best performance. After parameter optimization, the highest R2 value was calculated to be 0.68 using the Random Forest algorithm, and the second-highest R2 value was 0.56 using the Extreme Gradient Boosting algorithm. In this study, the results of the Random Forest algorithm, which provided the highest prediction accuracy, were adopted as the basis for the clustering analysis. The K-means clustering algorithm was used to partition the data into meaningful clusters. Graphical analysis using the elbow method and a detailed examination of cluster properties led to the conclusion that 5 clusters best represented customer segments and were the optimal number of clusters. The aim is to increase satisfaction and loyalty among high value customers by offering special deals and by activating low value customers through incentives and promotions.
-
ÖgePredicting customers with higher probability to purchase in telecom industry(Graduate School, 2025-01-22) Yıldırım, Güzide Nur ; Ergün, Mehmet Ali ; 528211085 ; Big Data and Business AnalyticsUnderstanding customer insights, predicting their next steps, and making accurate recommendations have become a significant issue for customer service organizations in today's world. With the widespread use of important technologies such as Big Data Analytics and Machine Learning, the contribution of Artificial Intelligence to digital transformation in the industrial world has significantly affected the way businesses work. The use of Artificial Intelligence technologies in business processes has increased the success ratio of marketing strategies, business models and sales statistics. Companies that provide customer services have units responsible for CRM, as known as customer relations management. Big Data accumulated and analyzed by CRM units can be modeled with artificial intelligence algorithms for customer centric use cases. Thereby, outcome of such an approach guide the customer strategies built by companies. In this article, studies on the business analytic-based understanding of customers with high probability of purchasing by training the CRM data used in the telecom industry will be included.