LEE- Büyük Veri ve İş Analitiği Yüksek Lisans
Bu koleksiyon için kalıcı URI
Gözat
Çıkarma tarihi ile LEE- Büyük Veri ve İş Analitiği Yüksek Lisans'a göz atma
Sayfa başına sonuç
Sıralama Seçenekleri
-
ÖgeOptimizing backup crew planning in airlines using machine learning(Graduate School, 2024-07-31) Baraçlı Düzcan, Şuheda ; Ulukuş, Mehmet Yasin ; 528211088 ; Data Engineering and Business AnalyticsDue to the inherent nature of the aviation sector, it entails a highly challenging and complicated operational process. There are many internal and external factors that affect operational processes. All process steps are optimized by addressing them as separate challenges. Crew planning, which is also a significant part of operations and ranks second in terms of expenses, plays a crucial role in maximizing revenue and minimizing costs. Crew planning in the aviation industry is complex, made even more challenging by external factors like weather changes, crowded airspace, and unexpected technical problems. Planning flights, scheduling crews, and matching them with each other are considered as separate difficult problems. Additionaly, ensuring efficient communication and coordination between these components further complicates the task. Taking precautions against disruptions in these processes is essential to prevent them from affecting the entire flight and operation process. Moreover, implementing contingency plans and backup strategies can help minimize the impact of unexpected events, ensuring smoother operations and maintaining overall efficiency in aviation management. Adaptability and quick response are vital in this dynamic environment. In these processes where dynamism and fast problem-solving are critical, additional efforts are required to manage the team effectively. In this regard, reserved crew planning comes into play, which is essential for the smooth management of unplanned flights and operational disruptions. Reserved crew planning is necessary for unplanned flight and operation disruptions. As crew absence can impact the entire flight process, it's crucial to optimize the number of crew members to be kept on reserved daily. The optimization of the reserved crew is addressed using solution methos for crew assignment problem. Using advanced analytics and machine learning, airlines can better understand operational trends and make smarter decisions on the fly. This research explores how these solutions can enhance reserved crew planning. Experiments were conducted on a classification problem using various machine learning models such as K-Nearest Neighbors (KNN), Random Forest, XGBoost, and Logistic Regression. The performance of each model was compared using various evaluation criteria.
-
ÖgeOvercoming payment behavior challenges: Classifying buy now pay later users with machine learning(Graduate School, 2024-08-08) Özdoğan, Ömür ; Ergün, Mehmet Ali ; 528211073 ; Big Data and Business AnalyticsWhen the economies of developing countries are examined closely, issues such as high debt rates, limited access to funding and difficulties in accessing financing stand out. While Turkey is considered in the category of developing countries, various steps have been taken to strengthen financial stability within the scope of financial tightening policies in recent periods. These steps include steps such as increasing the risk weights of consumer loans, individual credit cards and vehicle loans, and reducing or eliminating the installments on credit cards in certain sectors. Thus, consumers' financial access continues to be increasingly restricted. At this point, Fintech companies offer different alternatives as another option to the restrictions in the traditional financial system. Fintech companies aim to offer low cost, fast and innovative solutions to their customers by using modern data analysis techniques and new generation technologies such as AI. Additionally, these companies aim to reach people who have limited access or are looking for alternatives, in addition to users who do not have access to the banking sector. In this study, first, the development and future of Fintech companies are mentioned and the main product of the study, Buy Now Pay Later, is explained in detail and examples from global BNPL providers are given. Then, credit risk and machine learning techniques in credit risk were also mentioned. During the study, data analysis was performed using sample data of approximately 35,000 customers using the BNPL loan product developed by a London-based Fintech company for one of Turkey's leading retail market chains. A multi-class segmentation problem was designed for users, and popular machine learning methods were used to predict in 3 different classes whether the loans will be paid on time or not. During the experimental phase, Random Forest, Extreme Gradient Boosting and Deep Learning algorithms were tested and performance tables for the models were prepared. When comparing model performances, values such as accuracy, f1-score, roc curve and confusion matrix were determined as priority metrics. As a result, it has been determined that the use of machine learning modeling is quite advantageous when classifying credit risk. However, before deciding on model selection, companies' strategies and policies should always be taken into consideration, and at the same time, the bad loan risks they want to take should be evaluated.
-
ÖgePredicting stock prices in bist: A reinforcement learning and sentimental analysis approach(Graduate School, 2024-08-08) Eğe, Şeyma ; Ergün, Mehmet Ali ; 528211080 ; Big Data and Business AnalyticsThe stock exchange is an environment where the buying and selling of financial products take place through intermediaries. In financial markets, the most traded financial instruments are equity shares. Investors aim to enhance their profitability by investing in this financial instrument but due to the high volatility of stock prices, they are a high-risk financial product. For this reason, researchers from different fields have conducted studies on predicting stock prices. Many methods have been developed in studies on this topic from past to present. In literature, statistical methods such as AutoRegressive (AR), AutoRegressive Integrated Moving Average (ARIMA), AutoRegressive Conditional Heteroskedasticity (ARCH), Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) have been commonly used in research. Researchers have noted that sudden price fluctuations or unexpected events can disrupt prediction performance when employing these methods. Recently, numerous prediction models have emerged in research, leveraging machine learning and deep learning algorithms. In complex scenarios, datasets containing abrupt price changes and unexpected events have sometimes yielded improved results with these algorithms. In existing literature, variables derived from fundamental and technical analyses have been commonly employed. Additionally, sentiment analysis-based variables have been utilized. The overarching goal has been to capture the impact of geopolitical, global, and economic indicators on stock prices within these models. In research on predicting stock prices in the context of Borsa Istanbul, there are deficiencies in the existing literature, especially regarding the use of methods such as reinforcement learning, technical indicators and sentimental analysis in the same study. This study aims to address this gap by investigating predictive indicators and sentimental analysis along with various reinforcement learning techniques. Our research involves predicting price changes for specific stocks traded in Borsa Istanbul using deep reinforcement learning techniques (DQN, DDQN, DDDQN). The performance evaluation of these techniques was conducted, and a buy-and-hold strategy was used as a benchmark for comparison. For this study, shares of 8 companies traded on Borsa Istanbul were selected. Daily information of these stocks between January 2021 and March 2024 and technical indicators calculated from this information are used in the data. Apart from this information, sentimental analysis of the material event disclosures reported by the companies to the Public Disclosure Platform was made with the distilroberta model. A separate dataset was prepared for each company stock. DQN, DDQN, DDDQN algorithms were used in the study. When the test period results are examined, profit and reward performances differ on a stock basis. Notably, the DDQN algorithm stood out in securing rewards for six of the eight selected stocks. The results indicate that model performance varies significantly across different stocks. While the DDQN and DDDQN models were successful in certain situations, the Buy-and-Hold strategy proved more effective for some stocks. This suggests that model-based strategies can outperform Buy-and-Hold under certain conditions, but their effectiveness is largely dependent on the specific stock in question. Additionally, the impact of the Kap score on model performance was evaluated. It was observed that the inclusion of the cap score as an input generally enhanced the performance of the DQN, DDQN, and DDDQN models. Obtained results have led to conclusions regarding future studies.
-
ÖgeCustomer lifetime value prediction and segmentation analysis for commercial customers in the banking industry(Graduate School, 2024-08-12) Bakır Tartar, Feyza ; Tuna, Süha ; 528221082 ; Big Data and Business AnalyticsIn Türkiye, the banking sector plays a pivotal role in the growth of the national economy and maintenance of financial stability. Understanding and evaluating the behavior of corporate customers in the banking sector requires an accurate and comprehensive analysis. Customer Lifetime Value is a critical metric for understanding customer attitudes, therefore its accurate calculation is crucial. In this thesis, data on corporate customers from a company operating as a participation bank in Türkiye were used to create Customer Lifetime Value scores using various analytical techniques. In the first stage of the study, data from the past two years for each customer were collected from relevant databases and organized on a quarterly basis to calculate Customer Lifetime Value scores. The dataset was checked for missing values, inconsistencies, and errors. Outliers were identified using the z-score method during the data-cleaning process. In this study, data points with an absolute z-score greater than 3 were considered outliers. This threshold is a commonly used rule when assessing whether data conform to a normal distribution and helps minimize the impact of extreme values. These outliers are typically data points that could distort the overall structure of the dataset and negatively affect the model performance; thus, they were removed from the dataset. Before proceeding to the modeling phase, a standard scaling process was implemented to standardize the data. Scaling is a preprocessing step performed to eliminate problems arising from data with different units of measurement and to ensure that all data are on the same scale. Following the scaling process, the Customer Lifetime Value was predicted using five distinct machine learning algorithms. In this process, the outcomes of the Random Forest, Light Gradient-Boosting Machine, Extreme Gradient Boosting, Elastic-Net, and Linear Regression algorithms were assessed. To assess the model results, the values of Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, R2, and the Adjusted R2 were used. These metrics were employed to utilized how well the model predictions aligned with the target variable. Upon analyzing the results, the highest R2 score was found to be 0.55 with the Random Forest algorithm, with Light Gradient-Boosting Machine being the second-best algorithm. Following the evaluation of the results, parameter optimization was performed using the Grid Search Cross Validation model to enhance the model performance and achieve the best possible results. Grid Search Cross Validation is a technique that explores all possible combinations within a specified parameter set to identify hyperparameters that deliver the best performance. After parameter optimization, the highest R2 value was calculated to be 0.68 using the Random Forest algorithm, and the second-highest R2 value was 0.56 using the Extreme Gradient Boosting algorithm. In this study, the results of the Random Forest algorithm, which provided the highest prediction accuracy, were adopted as the basis for the clustering analysis. The K-means clustering algorithm was used to partition the data into meaningful clusters. Graphical analysis using the elbow method and a detailed examination of cluster properties led to the conclusion that 5 clusters best represented customer segments and were the optimal number of clusters. The aim is to increase satisfaction and loyalty among high value customers by offering special deals and by activating low value customers through incentives and promotions.
-
ÖgeDiabetic retinopathy classification with using deep learning(Graduate School, 2024-08-15) Şahin, Mehmet Alper ; Beyca, Ömer Faruk ; 528211070 ; Big Data and Business AnalyticsThis study focuses on developing ensemble deep learning model to classification of diabetic retinopathy (DR) which stands as the most prevalent among microvascular complications associated with diabetes. Even though numerous modeling studies have been conducted for DR classification using various datasets, the differences in fundus photographs influenced by genetic factors may affect the effectiveness of these models, complicating their generalization. In addition to that, thse differences may cause reduced model performance in different regions. Furthermore, given the limited accessibility of health services in developing countries, the signifance of condusting modeling studies on the DR in these regions becomes even more pronounced. In the light of this information, the limited access "Brazilian Multi-Label Opthalmological Data Set" was determined as the reference data set, and all model improvements were carried out based on this data set. DR, macular edema, scar, nevus, drusens, cup disc increase, myopic fundus and age-related macular degeneration, which are listed ad eye abnormalities, are labelled on fundus photographs for the relevant data set. In this study, the AlexNet basis model architecture plays a crucial role as the fundemantal framework for detecting given eyes abnormalities. Subsequently, the outputs of each model are leveraged to refine the DR classification. Using randomly selected data points from the data set, the diagnostic evaluations made by opthalmologists on these fundus photographs are compared with the prediction results of the developed model. The designed model is aimed at improving the decision-making abilities of experienced ophthalmologists in detecting DR. Its main purpose is to reduce possibility of misdiagnosis, which is critical concern in medical evaluations that require high levels of focusing and where even small details creates a valuable insight. By empowering professionals with more robust diagnostic tools and insights, it aims to increase the accuracy and efficiency of diagnoses, ultimately minimizing the occurrence of erroneous conclusions in the classification of DR. Deep learning models were developed using AlexNet architecture, one of the CNN structures, to detect 9 different eye anomalies. Training parameters were tuned by comparing the accuracy and recall metrics of the developed models. The accuracy and recall values of diabetic retinopathy model on test set are 0.76, and 0.73 respectively. In order to improve DR prediction capability, a system has proposed, this system predict the DR class respect to combination of prediction probabilities of the other eye anomalies. A fundus photo is given as input to DR model returns the prediction class if the probability of DR is less than 0.2 or greater than 0.4; otherwise, if the probability of predictind DR is between 0.2 and 0.4, the fundus photo is given as input to other developed models. A final output is produced by weighting the prediction probabilities which is the output of the other anomalies model according to the Pearson correlation coefficient of the relevant anomaly with DR. The proposed method yielded a notable increase in recall for validation set, achieving value of 0.76, which corresponds to an improvement of 8.57%. Similarly, the recall for the test set demonstrated a valuable enhancement, with an observed improvement of 5.2%.
-
ÖgeDeveloping a new system for advertisement analysis using gaze and depth analysis methods(Graduate School, 2024-08-15) Baday, Fatih ; Beyca, Ömer Faruk ; 528211078 ; Big Data and Business AnalyticsThis thesis proposes an innovative methodology based on observers' perspectives to deeply understand the interaction and persuasive capacity of advertisements on individuals. The core aim of this study is to develop a system capable of assessing the interest in advertising content based on the directions of observers' gazes. In this thesis, four fundamental steps have been followed to thoroughly examine the interest towards advertising content. Each step involves the technological methods and algorithms of a system that contributes to understanding the interactive power of advertisements, starting from the observers' viewpoints. The research begins with the identification of human figures and their eye positions in video frames using the DensePose algorithm. Following this step, the Gaze360 model, which utilizes deep learning techniques to predict individuals' viewing angles, is introduced. Developed by a group of researchers at MIT, this model advances on the foundation of ResNet architectures to determine where individuals are looking. In the third stage, depth information obtained through the ManyDepth algorithm is used to ascertain the spatial positions of objects in images, and this information is employed to create a depth map. The integrated data are utilized to determine which advertising contents are more engaging to viewers. The study emphasizes the process of carefully adjusting and testing the integration of the technological systems used, with the goal of providing a detailed examination of interaction dynamics within the advertising sector. As a result, it develops new methods that allow for a more accurate measurement of the impact of advertisement content on viewers. Instead of focusing on specific field tests, this thesis explores the theoretical foundations of the proposed methodology and offers a comprehensive framework on how user interaction in the advertising sector can be more effectively optimized. This approach aims to lay a foundation for future research and to contribute methodological and theoretical insights towards enhancing the effectiveness of advertising strategies. Therefore, it seeks to aid in the development of more effective advertising techniques and to maximize the persuasive power of advertisements.
-
ÖgeComperative evaluation of unsupervised fraud detection algorithms with feature extraction and scaling in purchasing domain(Graduate School, 2024-08-21) Taşoğlu, Yiğit Can ; Ergün, Mehmet Ali ; 528211079 ; Big Data and Business AnalyticsThe main aim of the research is to evaluate and compare various unsupervised outlier detection methods that do not require labeled data, making them suitable for real-world purchasing data where labels are often unavailable. The thesis highlights the challenges of fraud detection in large datasets, particularly in industries like finance and purchasing, where fraudulent activities can cause significant financial losses if not identified early. The motivation behind the research lies in the limitations of traditional, rule-based detection methods, which often fail to capture complex fraud patterns. Unsupervised algorithms, which can detect anomalies based on their deviation from the general behavior of the dataset, offer a proactive approach to fraud detection by identifying unseen fraud concepts. This study applies various methods, including distance-based, machine learning-based, and feature-based models, and focuses on enhancing these models through feature extraction and scaling techniques. The thesis evaluates several algorithms, such as Local Outlier Factor (LOF), DBSCAN, and Isolation Forest, using performance metrics like accuracy, precision, recall, and F1 score. LOF was identified as the most effective model, achieving the highest accuracy and demonstrating a robust ability to detect irregular patterns in the purchasing data. However, the effectiveness of all algorithms was significantly enhanced by data transformations, particularly scaling. Scaling ensures that features with differing magnitudes, such as quantities and prices, do not distort the results, allowing for more accurate anomaly detection. The importance of feature extraction is also emphasized, as it helps identify intricate patterns between data points. Extracted features, such as the frequency of purchase orders, vendor categories, and purchase amounts, provide deeper insights into potential fraud indicators. Additionally, the study recognizes that the integration of multiple models can reduce the limitations inherent in individual algorithms, thus creating a more comprehensive fraud detection framework. By combining different unsupervised methods and leveraging feature extraction, the research offers a more adaptive and reliable approach to identifying fraudulent activities. In conclusion, this study proves that employing a combination of unsupervised outlier detection methods, along with appropriate data preprocessing techniques, significantly improves fraud detection in purchasing systems. These methods not only enhance accuracy but also help businesses reduce financial risks and improve operational efficiency, ensuring a more secure and effective fraud prevention strategy.
-
ÖgeNew proposed methods for synthetic minority over-sampling technique(Graduate School, 2024-08-21) Korul, Hakan ; Ergün, Mehmet Ali ; 528211075 ; Big Data and Business AnalyticsMachine Learning (ML), artificial intelligence (AI), aims to mimic human learning processes using data and algorithms. Machine learning algorithms are trained on data, which can be labeled or unlabeled, with the aim of finding patterns by considering each data variable and its corresponding label. Algorithms trained on data predict future data labels based on patterns learned from previously observed data variables. In machine learning algorithms, there is an error function that evaluates the algorithm's performance on the dataset, and to improve performance, the algorithm tries to find the one with a lower error value than the error function obtained previously. The proliferation of ML algorithms and their utilization across various fields in the literature has led to the resolution of many challenging problems through machine learning. These problems can generally be categorized as supervised and unsupervised learning. Supervised learning, where data labels are known, can be further classified into regression and classification. Regression algorithms are used for problems such as predicting stock prices, predicting house prices, and forecasting weather conditions, while classification algorithms are employed for tasks like customer churn prediction, email spam detection, and image recognition. Problems where labels are not available in the dataset fall under unsupervised learning, which typically includes tasks like customer segmentation. Ensuring machines are trained properly and achieve high performance relies heavily on the quality of the data. The data used for machine learning ideally should be clean, balanced, and representative of real-world conditions. Additionally, the size of the dataset is crucial, as larger datasets usually lead to better results. Alongside these data characteristics, having balanced classes in classification problems is vital for classification machine learning algorithms. When it comes to classification problems, having balanced class distributions in the data is crucial for classification machine learning algorithms. If the class distribution is imbalanced, it leads to what is known as an imbalanced dataset. Training a machine learning model on a problem with balanced classes as opposed to imbalanced classes can result in noticeable differences in performance. For instance, in a problem with balanced classes, where 4 out of 10 customers churn, meaning almost a 50% probability, even a model with random guessing can make successful predictions. However, if the problem is such that only 1 out of 100 customers churn, random guessing becomes insufficient, posing an extremely challenging problem for machine learning models. To address this issue, various methods have been developed. In the example mentioned above, where 99 customers do not churn and 1 customer churns, these problems are typically categorized into majority and minority classes. These different methods can essentially be classified into three categories: increasing the number of minority class samples to approach the number of majority class samples (oversampling), decreasing the number of majority class samples to approach the number of minority class samples (undersampling), or using a combination of both methods (hybrid sampling). Directly comparing oversampling and undersampling methods can be quite challenging, as these methods may perform differently on different datasets. However, certain characteristics of the dataset can help expedite the selection of the appropriate method. For example, if the dataset is not very large, undersampling may not be suitable because it would eliminate majority class samples, resulting in a significantly smaller dataset. In such cases, oversampling methods that increase the overall dataset size should be preferred. Oversampling methods can be further divided into random oversampling and SMOTE. The most significant difference between the two methods lies in how they handle the generation of synthetic data. While random oversampling duplicates existing minority class samples in the dataset, SMOTE generates synthetic data points close to existing minority class samples using various techniques. Since random oversampling does not introduce new examples to the machine learning algorithm and can lead to overfitting, SMOTE methods generally yield better performance improvements. In the data generation stage with the SMOTE method, first, the number of k nearest neighbors is determined for each sample. To generate synthetic data, two examples are selected: the first is the minority sample itself, and the second is a randomly chosen minority class example among its k nearest neighbors. Then, the difference between the neighbor example and the minority example is calculated. Subsequently, this difference is multiplied by a random number between 0 and 1 and added back to the minority sample. This process ensures that the newly generated synthetic data is placed between the minority sample and one of its k nearest neighbors randomly selected. Following the emergence of the SMOTE method in the literature, efforts have been focused on developing this method from various perspectives. Borderline SMOTE and K-Means SMOTE are just some of the newly proposed methods in this context. In the Borderline SMOTE method, a border between the minority and majority classes is defined, and synthetic data generation is focused only on the minority samples on this borderline. In the K-Means SMOTE method, the K-means technique is utilized, where the training set is divided into k clusters, and then decisions are made on which clusters to use for oversampling, followed by applying SMOTE steps within these clusters. In my research, I developed three different new SMOTE methods. The first one, Genetic SMOTE, focuses on variable alteration using the crossover logic commonly used in genetic algorithms between a sample and its neighbor. Randomly selected features of the sample are chosen, then these features are taken from the neighbor, while the remaining features are directly taken from the sample, resulting in the generation of new synthetic data of the same size. The second one, Dual Borderline SMOTE, is inspired by the Borderline SMOTE method. Instead of considering the ratio of minority samples within the total number of neighbors applied for Borderline, which ranges between 0.5 and 1, I adjusted it to range between 0 and 1. The significant difference between this method and Borderline SMOTE is that, while Borderline SMOTE does not consider whether the neighbors of the minority sample are on the border after finding a minority sample on the border, in this method, both the sample and its neighbor must be on the newly defined border. The last method developed is Genetic Dual Borderline SMOTE, which combines these two methods. This method applies the Dual Borderline SMOTE rules for selecting minority samples and their close minority sample neighbors, meaning that both the selected sample and its neighbor must lie on the defined borderline. When generating synthetic data from the selected two examples, instead of applying SMOTE as in the Dual Borderline SMOTE method, it focuses on variable alteration similar to the Genetic SMOTE steps. In the final section, to compare the performance of the three developed methods with the other three methods used, 8 datasets and 4 different machine learning algorithms were employed. For each dataset and algorithm, a total of 6 different SMOTE methods were evaluated for performance. During performance comparison, various parameter combinations of machine learning algorithms and SMOTE methods were tested for each dataset, and the parameters yielding the best results were selected for each dataset, model, and SMOTE method. The success of the SMOTE methods was compared using the parameters that provided the best F1 score. In the performance measurement, the F1 score of the minority class was preferred over metrics like accuracy, which are not suitable for imbalanced datasets. To reduce randomness and ensure a more reliable model, a 5-fold cross-validation score was used to calculate the F1 score. A total of 32 F1 scores (8 datasets, 4 machine learning algorithms) were obtained for each SMOTE method, and the F1 scores were illustrated in figures either overall or based on the model. Additionally, each dataset and machine learning algorithm were ranked from 1 to 6 for SMOTE methods, and the rankings were compared in the study. It has been observed that the newly developed methods outperform the existing ones. In model-based analyses, while there is not a significant difference in the performance of SMOTE methods in linear models, especially in ensemble methods, the Genetic Dual Borderline SMOTE method stands out noticeably from the others.
-
ÖgePredicting customers with higher probability to purchase in telecom industry(Graduate School, 2025-01-22) Yıldırım, Güzide Nur ; Ergün, Mehmet Ali ; 528211085 ; Big Data and Business AnalyticsUnderstanding customer insights, predicting their next steps, and making accurate recommendations have become a significant issue for customer service organizations in today's world. With the widespread use of important technologies such as Big Data Analytics and Machine Learning, the contribution of Artificial Intelligence to digital transformation in the industrial world has significantly affected the way businesses work. The use of Artificial Intelligence technologies in business processes has increased the success ratio of marketing strategies, business models and sales statistics. Companies that provide customer services have units responsible for CRM, as known as customer relations management. Big Data accumulated and analyzed by CRM units can be modeled with artificial intelligence algorithms for customer centric use cases. Thereby, outcome of such an approach guide the customer strategies built by companies. In this article, studies on the business analytic-based understanding of customers with high probability of purchasing by training the CRM data used in the telecom industry will be included.