Doğal dil ile SQL ve görselleştirme koduna dönüşümde büyük dil modellerinin karşılaştırmalı analizi
Doğal dil ile SQL ve görselleştirme koduna dönüşümde büyük dil modellerinin karşılaştırmalı analizi
Dosyalar
Tarih
2025
Yazarlar
Uçar, Baykal Mehmet
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
İTÜ Lisansüstü Eğitim Enstitüsü
Özet
Veriye erişimin kolaylaştırılması ve teknik olmayan kullanıcılar için veri tabanlarıyla etkileşimin demokratikleştirilmesi, günümüz bilgi teknolojileri açısından büyük önem arz etmektedir. Bu bağlamda, doğal dil sorgularını yapılandırılmış SQL ifadelerine ve görsel Python kodlarına dönüştürebilen büyük dil modelleri (LLM – Large Language Models), geleneksel veri analizi süreçlerine önemli katkılar sunmaktadır. Bu tez çalışmasında, doğal dil ile veri sorgulama ve görselleştirme süreçlerini uçtan uca gerçekleştiren bir yapay zekâ destekli sistem tasarlanmış ve farklı büyük dil modellerinin bu sistem üzerindeki performansları karşılaştırmalı olarak analiz edilmiştir. Çalışma kapsamında OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet, Meta LLaMA 3.1 70B, Google Gemini 1.5 Flash ve DeepSeek Chat modelleri değerlendirilmiştir. Microsoft'un açık kaynaklı Semantic Kernel yazılım geliştirme kiti kullanılarak doğal dil sorgularının SQL ve Python kodlarına dönüştürüldüğü modüler ve genişletilebilir bir sistem mimarisi geliştirilmiştir. Sistem, kullanıcıdan gelen doğal dil girdisini şema açıklaması ile birlikte işleyerek çalıştırılabilir SQL ve Python kodları üretmekte, ardından bu kodları çalıştırarak sonuçları kullanıcıya sunmaktadır. Model başarımı hem insan değerlendirmesiyle hem de en başarılı modelin hakem olarak kullanıldığı LLM-tabanlı otomatik değerlendirme ile ölçülmüştür. İnsan değerlendirmesine göre Claude 3.5 Sonnet en yüksek doğrulukla çalışırken, GPT-4o ve Gemini Flash da özellikle SQL üretiminde başarılı sonuçlar vermiştir. LLM değerlendirmesinde ise Python görselleştirme kodlarının kalite farkları daha belirgin hale gelmiştir. LLaMA ve DeepSeek modelleri SQL çıktılarında rekabetçi sonuçlar sunarken, Python kod üretiminde daha düşük skorlar almıştır. Bu tez çalışması, farklı büyük dil modellerinin metinden koda dönüşüm yeteneklerini kapsamlı biçimde karşılaştırarak, model seçiminde ve sistem mimarisi kurulumunda yol gösterici olmayı amaçlamaktadır. Ayrıca Semantic Kernel tabanlı yaklaşım, yeni modellerin hızlı entegrasyonuna olanak sağlayan esnek bir altyapı sunmakta ve bu yönüyle sürekli gelişen LLM ekosistemine uyumlu bir çözüm önermektedir.
Facilitating efficient access to structured data and democratizing interaction with databases for non-technical users have become essential objectives in the modern data-driven world. As organizations increasingly rely on real-time insights for critical decision-making, the traditional barriers posed by technical requirements—particularly SQL proficiency—create friction in the analytics pipeline. Typically, data access involves either training non-technical users in structured query languages or relying heavily on specialized personnel, both of which incur significant time and resource costs. These bottlenecks hinder organizational agility and reduce the overall responsiveness of business units. With the emergence of Large Language Models (LLMs), new paradigms are being established in human-computer interaction. LLMs, trained on vast corpora of natural and formal language, have demonstrated the ability to interpret user queries in everyday language and translate them into executable SQL code and visualization scripts. This advancement has sparked growing interest in the application of LLMs to tasks that were previously constrained by technical knowledge barriers, such as business intelligence, exploratory data analysis, and interactive reporting. This thesis presents a comprehensive investigation into the effectiveness of five prominent LLMs in the task of converting natural language queries into SQL and Python-based visualization code. The selected models—OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet, Meta LLaMA 3.1 70B, Google Gemini 1.5 Flash, and DeepSeek Chat—represent a diverse range of development philosophies, architectural configurations, and access modalities. These models include both proprietary (closed-weight) systems and publicly available open-weight alternatives, allowing a holistic comparison across capabilities, usability, and performance dimensions. To support the evaluation, a robust and modular pipeline was developed using Microsoft's open-source Semantic Kernel SDK. The architecture is deliberately designed to be LLM-agnostic, allowing seamless integration and benchmarking of multiple models within a single orchestration environment. The system performs the following sequence: (i) receives a user query in natural language, (ii) combines it with database schema context, (iii) constructs prompt templates, (iv) generates SQL and Python code via selected LLMs, (v) parses and validates the outputs, and (vi) executes the code to produce visual analytics. Furthermore, automatic error handling and correction loops are incorporated to ensure stability and practical applicability in real-world usage scenarios. The evaluation phase is structured around two complementary methodologies. First, a human evaluation was conducted, where domain experts reviewed the outputs using a five-tier rubric measuring accuracy, readability, and ease of correction. Second, an LLM-based evaluation approach was employed, in which the top-performing model (Claude 3.5 Sonnet) was used as a consistent and scalable automated evaluator. This approach not only mitigates subjectivity in human scoring but also provides a reusable framework for future benchmarking efforts. The results of both evaluation strategies reveal notable insights into model strengths and weaknesses. Claude 3.5 Sonnet emerged as the most capable model across both SQL generation and Python visualization tasks, demonstrating high fidelity to user intent and producing well-structured, executable code with minimal errors. GPT-4o and Gemini 1.5 Flash also performed strongly in SQL generation tasks but struggled with Python visualizations, likely due to the additional structural and semantic complexity involved in code formatting and chart specification. LLaMA 3.1 70B, despite being an open-weight model, showed remarkable competitiveness in SQL-related tasks, underscoring the increasing parity between open and closed LLMs. However, both LLaMA and DeepSeek Chat showed limitations in visualization code generation, indicating room for improvement in translating abstract natural language queries into detailed and syntactically correct multi-line Python scripts. In addition to performance metrics, the study also examined token-level cost implications for each model under standardized query workloads. The analysis revealed significant differences in API usage costs, with GPT-4o offering high accuracy at a premium price point, while models like Gemini Flash and LLaMA 3.1 provided more cost-effective solutions with marginal trade-offs in quality. These findings offer important considerations for organizations balancing accuracy, scalability, and budget constraints in selecting LLMs for production environments. This thesis contributes to the academic and applied fields of artificial intelligence, data engineering, and human-computer interaction in several meaningful ways: Benchmarking Framework: It introduces a structured and reproducible evaluation framework for natural language to SQL and visualization code transformation, which can be adopted or extended by future research efforts. Model Comparison: It offers a side-by-side performance comparison of state-of-the-art LLMs, revealing subtle differences in their reasoning capabilities, contextual understanding, and output structure. Reusable Infrastructure: The LLM-agnostic pipeline built on Semantic Kernel provides a scalable foundation that can accommodate new models as they emerge, enabling continuous innovation without re-engineering the system architecture. Looking ahead, the study identifies several promising directions for future work. One such area is the orchestration of hybrid pipelines, where multiple LLMs are assigned to different subtasks based on their relative strengths—e.g., delegating SQL generation to one model and visualization to another. This task-level specialization has the potential to yield improved overall accuracy and reduce error propagation. Another promising avenue lies in dynamic prompt engineering, where system-generated schema explanations and refined user queries are used to enhance model interpretability and precision. Furthermore, efforts to improve the explainability and transparency of LLM-generated code—such as justifying each line of code—may enhance trust and facilitate broader adoption in enterprise environments. In conclusion, this research provides an in-depth assessment of the current capabilities and limitations of LLMs in facilitating natural language access to structured data. By delivering both a practical tool and an empirical analysis, the thesis addresses a pressing need in the field of AI-assisted data analysis and contributes to the broader goal of democratizing data access across organizational boundaries.
Facilitating efficient access to structured data and democratizing interaction with databases for non-technical users have become essential objectives in the modern data-driven world. As organizations increasingly rely on real-time insights for critical decision-making, the traditional barriers posed by technical requirements—particularly SQL proficiency—create friction in the analytics pipeline. Typically, data access involves either training non-technical users in structured query languages or relying heavily on specialized personnel, both of which incur significant time and resource costs. These bottlenecks hinder organizational agility and reduce the overall responsiveness of business units. With the emergence of Large Language Models (LLMs), new paradigms are being established in human-computer interaction. LLMs, trained on vast corpora of natural and formal language, have demonstrated the ability to interpret user queries in everyday language and translate them into executable SQL code and visualization scripts. This advancement has sparked growing interest in the application of LLMs to tasks that were previously constrained by technical knowledge barriers, such as business intelligence, exploratory data analysis, and interactive reporting. This thesis presents a comprehensive investigation into the effectiveness of five prominent LLMs in the task of converting natural language queries into SQL and Python-based visualization code. The selected models—OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet, Meta LLaMA 3.1 70B, Google Gemini 1.5 Flash, and DeepSeek Chat—represent a diverse range of development philosophies, architectural configurations, and access modalities. These models include both proprietary (closed-weight) systems and publicly available open-weight alternatives, allowing a holistic comparison across capabilities, usability, and performance dimensions. To support the evaluation, a robust and modular pipeline was developed using Microsoft's open-source Semantic Kernel SDK. The architecture is deliberately designed to be LLM-agnostic, allowing seamless integration and benchmarking of multiple models within a single orchestration environment. The system performs the following sequence: (i) receives a user query in natural language, (ii) combines it with database schema context, (iii) constructs prompt templates, (iv) generates SQL and Python code via selected LLMs, (v) parses and validates the outputs, and (vi) executes the code to produce visual analytics. Furthermore, automatic error handling and correction loops are incorporated to ensure stability and practical applicability in real-world usage scenarios. The evaluation phase is structured around two complementary methodologies. First, a human evaluation was conducted, where domain experts reviewed the outputs using a five-tier rubric measuring accuracy, readability, and ease of correction. Second, an LLM-based evaluation approach was employed, in which the top-performing model (Claude 3.5 Sonnet) was used as a consistent and scalable automated evaluator. This approach not only mitigates subjectivity in human scoring but also provides a reusable framework for future benchmarking efforts. The results of both evaluation strategies reveal notable insights into model strengths and weaknesses. Claude 3.5 Sonnet emerged as the most capable model across both SQL generation and Python visualization tasks, demonstrating high fidelity to user intent and producing well-structured, executable code with minimal errors. GPT-4o and Gemini 1.5 Flash also performed strongly in SQL generation tasks but struggled with Python visualizations, likely due to the additional structural and semantic complexity involved in code formatting and chart specification. LLaMA 3.1 70B, despite being an open-weight model, showed remarkable competitiveness in SQL-related tasks, underscoring the increasing parity between open and closed LLMs. However, both LLaMA and DeepSeek Chat showed limitations in visualization code generation, indicating room for improvement in translating abstract natural language queries into detailed and syntactically correct multi-line Python scripts. In addition to performance metrics, the study also examined token-level cost implications for each model under standardized query workloads. The analysis revealed significant differences in API usage costs, with GPT-4o offering high accuracy at a premium price point, while models like Gemini Flash and LLaMA 3.1 provided more cost-effective solutions with marginal trade-offs in quality. These findings offer important considerations for organizations balancing accuracy, scalability, and budget constraints in selecting LLMs for production environments. This thesis contributes to the academic and applied fields of artificial intelligence, data engineering, and human-computer interaction in several meaningful ways: Benchmarking Framework: It introduces a structured and reproducible evaluation framework for natural language to SQL and visualization code transformation, which can be adopted or extended by future research efforts. Model Comparison: It offers a side-by-side performance comparison of state-of-the-art LLMs, revealing subtle differences in their reasoning capabilities, contextual understanding, and output structure. Reusable Infrastructure: The LLM-agnostic pipeline built on Semantic Kernel provides a scalable foundation that can accommodate new models as they emerge, enabling continuous innovation without re-engineering the system architecture. Looking ahead, the study identifies several promising directions for future work. One such area is the orchestration of hybrid pipelines, where multiple LLMs are assigned to different subtasks based on their relative strengths—e.g., delegating SQL generation to one model and visualization to another. This task-level specialization has the potential to yield improved overall accuracy and reduce error propagation. Another promising avenue lies in dynamic prompt engineering, where system-generated schema explanations and refined user queries are used to enhance model interpretability and precision. Furthermore, efforts to improve the explainability and transparency of LLM-generated code—such as justifying each line of code—may enhance trust and facilitate broader adoption in enterprise environments. In conclusion, this research provides an in-depth assessment of the current capabilities and limitations of LLMs in facilitating natural language access to structured data. By delivering both a practical tool and an empirical analysis, the thesis addresses a pressing need in the field of AI-assisted data analysis and contributes to the broader goal of democratizing data access across organizational boundaries.
Açıklama
Tez (Yüksek Lisans)-- İstanbul Teknik Üniversitesi, Lisansüstü Eğitim Enstitüsü, 2025
Anahtar kelimeler
insan-yapay zeka etkileşimi,
human-artificial intelligence interaction