Obfuscated JavaScript detection using syntactically and lexically enhanced machine learning

dc.contributor.advisorSandıkkaya, Mehmet Tahir
dc.contributor.authorKılıç, Eren
dc.contributor.authorID866235
dc.contributor.departmentDepartment of Computer Engineering
dc.date.accessioned2024-12-16T07:55:13Z
dc.date.available2024-12-16T07:55:13Z
dc.date.issued2023
dc.descriptionThesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2023
dc.description.abstractWeb-based attacks have always been a critical security concern over the past few decades. Since JavaScript is the most widely used programming language in web application development for years, JavaScript attacks have become increasingly popular among malicious actors. These attacks can lead to significant outcomes, such as gaining unauthorized access, stealing personal information, exposing data, causing financial damage, and disrupting services. Attackers frequently provide various obfuscation techniques to modify and obscure their malicious source code in order to make it more challenging to understand and evade detection by intrusion prevention and detection systems. This situation makes obfuscated JavaScript source codes potentially harmful and highlights the importance of obfuscation detection, which should be supported by security systems as a critical task. Identification of obfuscated JavaScript source codes is difficult, as numerous obfuscation techniques are employed by intruders. In this thesis paper, a literature review and background information about JavaScript attacks, obfuscation, obfuscation techniques, obfuscation detection, machine learning, and natural language processing are given. The existing obfuscation methods, including static and dynamic analysis, are reviewed with their advantages and limitations. Moreover, a novel machine learning model which is built using syntactic and lexical-based analysis features is proposed in this thesis study. This approach presents two novel features that benefit from natural language processing techniques and contributes to the model discussed in previous work. The first feature is the proportion of meaningful words from natural languages like English to the total number of words in the script. Due to the clean coding principles, such as using descriptive names for variables and functions that are easy to follow and understand, non-obfuscated JavaScript source code is likely to have a greater number of meaningful words from real-world languages. Thus, this feature can help to classify obfuscated and non-obfuscated JavaScript. The second feature is related to the n-gram words ratio in the script. N-grams are contiguous sequences of n items (characters or tokens in the JavaScript in this study). By analysing the ratio of n-gram words, the structure and composition of the code can be quantitatively measured to compare obfuscated JavaScript source codes against non-obfuscated ones. Finally, the effectiveness of the proposed model is evaluated and compared with the model in the previous work. For this, a labelled data set having obfuscated and non-obfuscated real-world JavaScript source codes are preprocessed and used. The proposed and previous models are trained with various binary classification algorithms.
dc.description.degreeM.Sc.
dc.identifier.urihttp://hdl.handle.net/11527/25804
dc.language.isoen
dc.publisherGraduate School
dc.sdg.typeGoal 9: Industry, Innovation and Infrastructure
dc.subjectcomputer security
dc.subjectcomputer science
dc.subjectcontrol
dc.titleObfuscated JavaScript detection using syntactically and lexically enhanced machine learning
dc.title.alternativePerdelenmiş JavaScript kodlarının sözdizimsel ve anlamsal yönden iyileştirilmiş makina öğrenmesi ile tespiti
dc.typeMaster Thesis

Dosyalar

Orijinal seri

Şimdi gösteriliyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
Ad:
866235.pdf
Boyut:
1.23 MB
Format:
Adobe Portable Document Format

Lisanslı seri

Şimdi gösteriliyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
Ad:
license.txt
Boyut:
1.58 KB
Format:
Item-specific license agreed upon to submission
Açıklama