Anomaly detection in computer system logs using semi-supervised learning and natural language processing
News
Anomaly detection in computer system logs using semi-supervised learning and natural language processing
Abstract
The detection of anomalies in computer system logs is crucial for maintaining reliable technological infrastructures. This study introduces a novel approach combining Semi-supervised learning with Natural Language Processing to analyze log files for early identification of potential system failures. The methodology employs a specialized log parser based on semantic graphs alongside context-independent embedding models for text vectorization, focusing on collective rather than point anomalies. Experiments were conducted on both the public HDFS dataset and a proprietary Vertica database dataset containing over 830 million logs. Results demonstrate that the obtained solution based on autoencoders with convolutional layers can effectively detect system anomalies when paired with appropriate preprocessing techniques. The approach achieved impressive performance metrics on the HDFS dataset, particularly when using TF-IDF token weighting, with a Fault Detection Rate of 0.982 and ROC AUC of 0.811. Additionally, testing on the Vertica dataset successfully identified anomalous periods preceding system failures. The findings indicate that predictive maintenance approaches traditionally applied to technical equipment can be successfully adapted for computer systems, enabling proactive intervention before critical failures occur and potentially reducing the significant costs associated with system downtime.
Keywords
Edition
Proceedings of the Institute for System Programming, vol. 38, issue 3, part 2, 2026, pp. 133-148
ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).
DOI: 10.15514/ISPRAS-2026-38(3)-25
For citation
Full text of the paper in pdf
Back to the contents of the volume