Alibaba Cloud boosts failure prediction with logfile timestamps

Machine learning helps, but more data catches more faults - so Chinese champ has shared its data

Simon Sharwood Tue 3 Sep 2024 // 04:01 UTC

Alibaba Cloud has revealed homebrew tech it used to improve server fault prediction and detection, which it claims saw its ability to detect problems beat comparable tech by ten percent.

The Chinese cloud champ's claims emerged last week in a paper [PDF] presented at the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

The document points out that reliability is a major selling point for public clouds, making predicting failures an important ability. Log files, the authors observe, contain plenty of info on "exceptions" to normal performance that indicate potential performance problems. The authors opine that tools using logs to predict failures rely on machine learning and deep learning to detect future failures, when more obvious indicators – timestamps – aren't paid the attention they are due.

Here's the thinking, in a nutshell:

The time interval lengths between successive exceptions often reflect the urgency and severity of the anomalies. For instance, a server with 1,000 "machine check exceptions" in three days may not fail, but a server with 1,000 such exceptions in five minutes tends to fail. Therefore, effective failure prediction must adequately make use of the exception timestamp information.

Alibaba Cloud therefore created its own tool called Time-Aware Attention-Based Transformer (TAAT) to analyze timestamp info.

TAAT doesn't entirely ignore ML tools. Instead, it uses the Bidirectional Encoder Representations from Transformers (BERT) – a language model developed by Google that represents text as vectors and has been used to predict server failures. The paper asserts, however, that BERT hasn't been tuned to make full use of log timestamps.

Alibaba's tool therefore relies on BERT for some failure analysis and compares that with TAAT's analysis of logfile timestamps. The paper contains a lot of math describing exactly how Alibaba analyzes log info, but the bottom line was apparently a ten percent improvement in fault predictions – and presumably slightly more reliable cloudy IaaS.

Alibaba's boffins think TAAT's output is also useful because it doesn't need expert analysis – meaning folks familiar with cloudy crashes aren't needed to help as often. It's already in production at Alibaba Cloud.

TAAT appears not to be available for download. But Alibaba Cloud has posted a colossal dataset comprising "∼2.7 billion syslogs from ∼300,000 servers in a four-month period of the real productional system of Alibaba Cloud" to help researchers consider how to develop log sampling strategies of their own to inform future failure prediction efforts.

The authors have also posted a video outlining TAAT's operation. ®

Software

Devops

Alibaba Cloud boosts failure prediction with logfile timestamps

Machine learning helps, but more data catches more faults - so Chinese champ has shared its data

Admins wonder if the cloud was such a good idea after all

Broadcom has brought VMware down to earth and that’s welcome

The elusive dream of cloud portability: Why migrating workloads isn't so simple

Alibaba and Tencent clouds see demand for CPUs level off, GPUs accelerate

China AI devs use cloud services to game US chip sanctions

When it comes to cloud, it's China against the world

Alibaba Cloud claims K8s service meshes can require more resources than the apps they run

HPE nabs long-time ally Morpheus Data

If the world had a hyperscale datacenter capital, it would be... Northern Virginia

Cloud growth puts hyperscalers on track to take up 60% of datacenter capacity by 2029

Huawei Cloud built a network monitor so sensitive it spotted the impact of a single faulty chip

Japan's Fugaku supercomputer released in virtual version that runs in AWS

About Us

Our Websites

You Privacy