Anomaly detection in computer system logs using semi-supervised learning and natural language processing


Anomaly detection in computer system logs using semi-supervised learning and natural language processing

Kiriachek V.A. (RUDN, Moscow, Russia)
Salpagarov S.I. (RUDN, Moscow, Russia)

Abstract

The detection of anomalies in computer system logs is crucial for maintaining reliable technological infrastructures. This study introduces a novel approach combining Semi-supervised learning with Natural Language Processing to analyze log files for early identification of potential system failures. The methodology employs a specialized log parser based on semantic graphs alongside context-independent embedding models for text vectorization, focusing on collective rather than point anomalies. Experiments were conducted on both the public HDFS dataset and a proprietary Vertica database dataset containing over 830 million logs. Results demonstrate that the obtained solution based on autoencoders with convolutional layers can effectively detect system anomalies when paired with appropriate preprocessing techniques. The approach achieved impressive performance metrics on the HDFS dataset, particularly when using TF-IDF token weighting, with a Fault Detection Rate of 0.982 and ROC AUC of 0.811. Additionally, testing on the Vertica dataset successfully identified anomalous periods preceding system failures. The findings indicate that predictive maintenance approaches traditionally applied to technical equipment can be successfully adapted for computer systems, enabling proactive intervention before critical failures occur and potentially reducing the significant costs associated with system downtime.

Keywords

anomaly detection; log analysis; semi-supervised learning; natural language processing; predictive maintenance; TF-IDF vectorization.

Edition

Proceedings of the Institute for System Programming, vol. 38, issue 3, part 2, 2026, pp. 133-148

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2026-38(3)-25

For citation

Kiriachek V.A., Salpagarov S.I. Anomaly detection in computer system logs using semi-supervised learning and natural language processing. Proceedings of the Institute for System Programming, vol. 38, issue 3, part 2, 2026, pp. 133-148 DOI: 10.15514/ISPRAS-2026-38(3)-25.

Full text of the paper in pdf Back to the contents of the volume