URL Shield: Detecting Malicious URLs Using Machine Learning Techniques
Keywords:
Malicious URL Detection · Machine Learning · Gradient Boosting · Feature Engineering · Phishing Detection · Cybersecurity · Flask · URL Classification · Ensemble Learning · Random Forest · SVM · Neural Network · SQLiteAbstract
The proliferation of cyber threats through malicious URLs has become one of the most significant challenges in
cybersecurity, with phishing attacks alone causing over $10.3 billion in losses globally in 2022. This paper presents
URLShield, a comprehensive machine learning-based web application that classifies URLs as Legitimate or Malicious
using 28 engineered features extracted directly from URL strings — without loading the target page, querying DNS,
or using WHOIS data — achieving sub-100ms classification latency.The system implements a comparative analysis
of eight diverse machine learning algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest
Neighbors (KNN), Support Vector Machine (SVM), Naive Bayes, Gradient Boosting, and Multi-Layer Perceptron
Neural Network, trained on a balanced synthetic dataset of 10,000 URLs (5,000 legitimate + 5,000 malicious) with
intentional 8% label noise to simulate real-world classification ambiguity. Gradient Boosting achieves the highest
accuracy of 92.35% with 93.10% recall — the production model for URLShield.The 28 features span four categories:
character count features (13), binary flag features (7), structural features (5), and ratio features (3), capturing URL
length, special characters, HTTPS usage, IP-based addresses, URL shorteners, suspicious TLDs, and phishing
keyword counts. The Flask web application provides PBKDF2-SHA256 authenticated user sessions, real-time URL
prediction with confidence scores, prediction history in SQLite, 12-chart EDA gallery, Chart.js interactive model
comparison dashboard, role-based admin access, and Docker deployment. The system demonstrates that feature-only
URL analysis achieves over 92% accuracy — competitive with content-based approaches at 100× lower latency and
zero security risk.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.










