DNS big data processing for detecting customersbehaviour of isp using an optimized apache spark cluster

Alkhanafseh, Yousef
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Graduate School
During the past few decades, technology fields, especially Internet of Things (IoTs),have surpassingly evolved which in turn have contributed to great proliferation of datasources. Unfortunately, at that time, the available data processing tools in terms of va-riety and advancement were insufficient to analyze that huge data in a reasonable time.They suffered from several problems such as slowness, lack of comprehensiveness,limit size of clusters, high expense. These problems have constituted major obstaclesfor the progress and achievement in Big data field. Therefore, data has been unemployedfor a while. However, when its enormous benefits such as making smart decisions,saving time and cost, monitoring servers, improving performance, minimizing hiddencorrelations, and providing high quality reports have been closely realized, process-ing big data started to be prevalent. When dealing with big data, the most famousquestion that can be asked is "how can big data analysis make the enterprise jobs andbusiness better?". Currently, huge amounts of structured and unstructured data-sets,called as big data, have started to be processed by different types of companies suchas telecommunications, software and hardware, marketplaces, social media and so on.The current advanced services, hardware, and software have played an important rolein promoting big data processing by making its analysis faster, easier and inexpensive.It is important to know the difference between big data and traditional data sources.The main difference between them can be clearly noticed in data size, types, frequency,capturing speed, and used processing tools. Despite the current advanced technolo-gies, processing ExaByte (EB) or even YottaByte (YB) of data in an efficient way thatincludes the optimal usage of used system by completely utilizing its precise features isstill a challenge and need an expert who has a good mathematical background, knowl-edge of statistics, and superior experience in this field. Based on that, this thesis aims toprovide a comprehensive approach of setting up a system that consists of three differentstages which are collecting, processing, and visualizing huge amount of DNS data,daily of 1.3 TB, using an optimized YARN-based Apache Spark cluster. The process isachieved in two different clusters in terms of their place of establishment. The first onewas established on cloud by using Amazon Web Services Elastic MapReduce (AWSEMR) and the other one was established on local machines using Apache Ambari.Nevertheless, in this project, just the cloud cluster was discussed and reported in detail.The main goal of the one who was on cloud is to determine the features of neededmachines for local cluster. Moreover, it adequately made the understanding of ApacheSpark various configurations easier by trying each one of them with different values.Additionally, different structures of Python codes, especially related to Pyspark, weretried in different ways in order to specify the most efficient one. Initially, the thesisstarts by stating an extensive introduction that takes into consideration different sub-jects such as big data concepts, properties, sources, importance, future, limitations,challenges, and processing tools. Moreover, the architecture of the used DNS servers was thoroughly explained by stating their general purpose and their working principle.Similarly, under the title of data collecting, the project's main big data, DNS, andthe other used data-sets, which are Call Detail Record (CDR), Customer RelationshipManagement (CRM), Carrier-grade Network Address Translation (CGNAT), and IP-Blocks, were distinctly clarified by representing a sample of each one in separate tables.All these data-sets are encrypted and only the concerned authorities can understandits content. Then, an additional data-set that was captured from internet websites wasintroduced by representing a sample of it. A web scraping method has been talkedabout as well. There were more than one thousand URLs which can be classified inalmost 31 categories including education, games, VPNs, Services, banks, economy,etc. After that, several services that are utilized to process the data such as ApacheSpark, Yet Another Resource Negotiator (YARN), Hadoop Distributed File System(HDFS), ZooKeeper, and Hive were briefly investigated by interpreting their impor-tance, working principle, architecture, and main configurations. Meticulously, ApacheSpark is the data processing engine in this project. On the other hand, HDFS and Hivewere used as general storages to save processed data-sets and metadata, respectively.Zookeeper is a service that is utilized in order to maintain centralized configuration in-formation and provide distributed synchronization. Other services such as AWS EMRand AWS s3 were also used in this project. AWS EMR is a platform that Apache Sparkclusters can be built on. AWS s3 is a cloud storage that was temporarily used for savingprocessed data-sets. Next, based on different factors, the differences between ApacheSpark APIs, which are Resilient Distributed Data-set (RDD), Dataframe, and Dataset,were concisely illustrated. Subsequently, a procedure of optimizing a YARN-basedApache Spark cluster was proposed by interpreting the used mathematical equationsand giving a detailed example of how to start the object of Apache spark in an optimalway. Both Apache Spark and YARN configurations that are related to applicationproperties, run-time environment and networking, shuffle behavior, compression andserialization, memory management, and execution behavior were extremely elaborated.Next, various experiments of processing data were done by using different cluster sizesthat started from small number of machines with a small amount of resources of RAMand vCores to huge ones with high number of machines and large amounts of RAM andvCores. These clusters were optimized based on the previously stated configurationsand the values that can be found on both Resourcemanager and Spark admin interfacewere exactly the same as the calculated ones that are related to the amount of RAM,number of vCores, number of containers, and parallel tasks which in turn confirms theefficient use of the available resources. As a result, about %95 of RAM and CPUs ofthe clusters were successfully utilized. On the other side, the results of the experimentswhich contain input data size, number of operations, execution time, and output datasize were efficiently reported. Based on these results, a local cluster that has the samefeatures of the most appropriate cluster that was obtained in the experiments, is locallyestablished. After that, the output DNS data was grouped based on specific schemaand saved in a compressed format which is Parquet that reduces the size of the dataapproximately four times. Then, it was transferred to an optimized Elasticsearch clusterwhich is established in order to make fast queries to the output data and visualize it byusing an interactive Kibana dashboard. The Elasticsearch cluster includes one masternode and two slave nodes. The indices of Elasticsearch were properly configured andsplit into small indices. Also, they were defined in a way that only uses needed featureswhich in turn leads to enhance and tune the work of disks. Captured visualizations have played a major role in determining useful information such as the situation of DNSservers, customers segmentations, distribution of DNS traffic across Turkey neighbor-hoods, types of customers, most visited categories, most used URLs, and suitable placesfor advertising. Eventually an application that is based on time siers forcasting wasmade. A sample of the output data was prepared to be used in a time series forecastingusing Facebook Prophet model which were selected after trying several models such asautoregression (AR), Seasonal Autoregressive Integrated Moving-Average (SARIMA)and Vector Autoregression (VAR). However, only a comparison between VAR andFbprophet is discussed in this project. The main target of this prediction is defining thedensity of the used DNS servers, giving information about missed data, and providingapproximate information about the future of servers. The models were evaluated bycomparing the test data-set with prediction one and calculating its mean absolute error.It was almost %2.49 for Fbprophet. In short, some of this thesis achievements can beconcluded as providing solid knowledge about cloud computing systems and big datadifferent processing tools, performing various experiments on different clusters withdifferent sizes and resources, establishing local cluster based on these experiments,transforming daily of 1.3 TB of raw data into meaningful information, and making asystem for processing new data continuously. Furthermore, these processed informa-tive DNS data is used in a wide range of fields such as congestion prediction for DNSservers, classifying customers, enhancing content delivery network of some specificwebsites, running successful market advertising campaigns.
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2022
Anahtar kelimeler
internet service providers, internet servis sağlayıcıları