Obfuscated JavaScript detection using syntactically and lexically enhanced machine learning
Yükleniyor...
Dosyalar
Tarih
item.page.authors
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
Web-based attacks have always been a critical security concern over the past few decades. Since JavaScript is the most widely used programming language in web application development for years, JavaScript attacks have become increasingly popular among malicious actors. These attacks can lead to significant outcomes, such as gaining unauthorized access, stealing personal information, exposing data, causing financial damage, and disrupting services. Attackers frequently provide various obfuscation techniques to modify and obscure their malicious source code in order to make it more challenging to understand and evade detection by intrusion prevention and detection systems. This situation makes obfuscated JavaScript source codes potentially harmful and highlights the importance of obfuscation detection, which should be supported by security systems as a critical task. Identification of obfuscated JavaScript source codes is difficult, as numerous obfuscation techniques are employed by intruders. In this thesis paper, a literature review and background information about JavaScript attacks, obfuscation, obfuscation techniques, obfuscation detection, machine learning, and natural language processing are given. The existing obfuscation methods, including static and dynamic analysis, are reviewed with their advantages and limitations. Moreover, a novel machine learning model which is built using syntactic and lexical-based analysis features is proposed in this thesis study. This approach presents two novel features that benefit from natural language processing techniques and contributes to the model discussed in previous work. The first feature is the proportion of meaningful words from natural languages like English to the total number of words in the script. Due to the clean coding principles, such as using descriptive names for variables and functions that are easy to follow and understand, non-obfuscated JavaScript source code is likely to have a greater number of meaningful words from real-world languages. Thus, this feature can help to classify obfuscated and non-obfuscated JavaScript. The second feature is related to the n-gram words ratio in the script. N-grams are contiguous sequences of n items (characters or tokens in the JavaScript in this study). By analysing the ratio of n-gram words, the structure and composition of the code can be quantitatively measured to compare obfuscated JavaScript source codes against non-obfuscated ones. Finally, the effectiveness of the proposed model is evaluated and compared with the model in the previous work. For this, a labelled data set having obfuscated and non-obfuscated real-world JavaScript source codes are preprocessed and used. The proposed and previous models are trained with various binary classification algorithms.
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2023
Konusu
computer security, computer science, control
