Unveiling the wireless network limitations in federated learning
Unveiling the wireless network limitations in federated learning
Dosyalar
Tarih
2022-05-27
Yazarlar
Eriş, Mümtaz Cem
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
Huge increase in edge devices over the world with powerful processors inspired many researchers to apply decentralized machine learning techniques so that these edge devices can contribute to train deep neural networks. Among those decentralized machine learning schemes, federated learning has gained tremendous sympathy as it grants privacy to the edge devices as well as diminishing communication costs. This is because federated learning does not need to access raw data nor store it, instead, clients would learn from their raw data locally and produce gradient updates. These gradient updates would be aggregated at the server. The raw data is kept at clients untouched, to a degree that only the trained gradient updates are shared with the parameter server. As a matter of fact, the privacy and security issues are mostly scaled down and the ML models instead of raw data would save communication overhead. Considering these issues, federated learning has emerged from distributed and decentralized learning yet it revolutionizes the training as it aggregates the locally trained ML models by edge devices. A typical federated learning scheme which is investigated in the thesis, includes many number of clients who calculate the gradient of the loss function by applying stochastic gradient descent method and it also consists of an aggregator that collects these gradients in each communication round. In each round, only randomly selected number of clients participate in federated learning with their calculated gradient. The gradient descent is estimated according to the local batch size which is the fraction of client's local raw data. Collected gradients by the server are averaged in the server and the averaged gradient is disseminated to the clients back. It is expected to see the convergence after many communication rounds, as many clients are anticipated to contribute and therefore train the model in the server about the data. Yet, the issues related to the network limitations for the federated learning process are not covered in the literature. In such federated learning applications and simulations, the network is assumed to be stable and the limitations that come with unstable network are overlooked. These simulations are mostly written on Python and the essential network settings are implicitly asserted. Quality of Service (QoS) parameters such as packet drop ratio and delay are not considered, however they stand as key factors for federated learning convergence since they can slow down or even prevent the convergence process. In fact, there are federated learning applications proposed in the literature which are real-time such as cache-based popular content prediction applications. Meaning that these applications are sensitive to packet drops and delays that are caused by the network. Therefore, delay and packet drops in the network must be thoroughly examined in order to make such federated learning applications feasible. To this end, an advanced federated learning simulation is introduced and results are shared in this study. The simulation includes not only clients and server which are producing gradient updates, but also a full network backbone which allows the observation of the QoS parameters in the federated learning process. To be able to achieve this, a network which consists of clients and the server of the federated learning is simulated using reputable NS3 (Network Simulator 3). The network is designed as dumbbell topology which includes 100 clients on the left hand side of the dumbbell and server on the right hand side. This makes the left router to be the bottleneck, thus the background traffic in the network causes packet drops there. Additional node to generate background traffic is placed in the same side with clients so that the packet drops are observed and the intensity of the packet drops can be arranged by a hyperparameter which is called the interarrival time of the packets that are generated via background traffic. Poisson distribution based background traffic is produced in the manner of the interarrival time between packets at the traffic generator node. By applying ns3-ai framework which enables NS3 and python processes to communicate, the network and the federated learning process are run simultaneously so that the observations on QoS can be made. Since millions of devices are expected to be involved in a federated learning application in which the speed of converge is essential and not all of the clients updates may increase the convergence, UDP (User Datagram Protocol) is utilized as transport layer protocol. These gradient updates are fragmented to UDP packets and are sent from clients to servers and servers to clients. Thus, whenever a UDP packet that carries client update is dropped, the whole client update must be discarded. As a result, discarded clients reduce the performance of the federated learning and cause significant drawbacks to the application. Initially, the experiment is validated by running countless simulations with different seed values. Validation is carried out by testing the reproducibility of the same experiments by comparing cross entropy error, accuracy of both server and clients and also packet drop rates. For various interarrival time values ranging from 250 milliseconds to 900 milliseconds many simulation scenarios are designed. The replication method is used to evaluate the results. This means that each scenario with different seeds are run 10 times and the results are presented with \%95 confidence interval. Among those scenarios, three of them are picked and are tagged as heavy, medium and light traffic intensity which correspond to 250, 400, 900 milliseconds interarrival time, respectively. The results are presented by giving maximum error rates, average success rates and per round test accuracies. The most erroneous batch that is detected in aggregated gradient at server is presented by maximum error percentage after each communication round. It shows the worst performing model and it is meaningful to demonstrate the unfavorable consequences of the background traffic to the performance. With heavy intensity traffic, maximum error percentage goes up to \%80 after round 90, whereas maximum error percentage is between \%10 and \%20 with light traffic. This shows the federated learning application's early vulnerability to the background traffic. With the assumption of the network being completely stable, then average success percentage of client update delivery becomes \%100. However, it is not realistic and average success percentage reduces and fluctuates according to the traffic intensity. As the traffic gets intense, less client updates are received by the parameter server for a successful aggregation. Finally, the test accuracy of various intensity traffic configurations are presented. Packet drops because of the bottleneck queue capacity overflow causes tremendous decrease for the test accuracy which is crucial for any federated learning application. For at least 200 communication rounds, the decline in the accuracy is evidently visible when the traffic is intense. More specifically, \%90 accuracy is reached over 120 rounds for high intensity traffic, while it is reached around 60 rounds for light traffic. The intensity of the background traffic becomes highly crucial consideration for potential time-critical federated learning applications. Confidence intervals on test accuracy are presented according to the traffic intensity. The convergence is achieved no matter what the traffic intensity is. Wide intervals can be seen in earlier rounds and it gets slightly wider if the intensity is higher. In addition to these, according to the traffic intensity or interarrival time, the amount of traffic data, the number of packets that are produced by the background traffic generator node, the data delivery rate and monitored interarrival time are presented as well. In the light of these results, an adaptive federated learning is proposed in order to cope with heavy intensity traffic. By using network metrics such as upload rate, transmission and queueing delay, the maximum number of clients that can be fit in a communication round is calculated and set as participation rate. This allows server to receive more client updates and increasing the performance of the federated learning under heavy background traffic
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2022
Anahtar kelimeler
wireless networks,
kablosuz ağlar,
learning techniques,
öğrenme teknikleri