Software

Devops

Alibaba Cloud boosts failure prediction with logfile timestamps

Machine learning helps, but more data catches more faults - so Chinese champ has shared its data


Alibaba Cloud has revealed homebrew tech it used to improve server fault prediction and detection, which it claims saw its ability to detect problems beat comparable tech by ten percent.

The Chinese cloud champ's claims emerged last week in a paper [PDF] presented at the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

The document points out that reliability is a major selling point for public clouds, making predicting failures an important ability. Log files, the authors observe, contain plenty of info on "exceptions" to normal performance that indicate potential performance problems. The authors opine that tools using logs to predict failures rely on machine learning and deep learning to detect future failures, when more obvious indicators – timestamps – aren't paid the attention they are due.

Here's the thinking, in a nutshell:

The time interval lengths between successive exceptions often reflect the urgency and severity of the anomalies. For instance, a server with 1,000 "machine check exceptions" in three days may not fail, but a server with 1,000 such exceptions in five minutes tends to fail. Therefore, effective failure prediction must adequately make use of the exception timestamp information.

Alibaba Cloud therefore created its own tool called Time-Aware Attention-Based Transformer (TAAT) to analyze timestamp info.

TAAT doesn't entirely ignore ML tools. Instead, it uses the Bidirectional Encoder Representations from Transformers (BERT) – a language model developed by Google that represents text as vectors and has been used to predict server failures. The paper asserts, however, that BERT hasn't been tuned to make full use of log timestamps.

Alibaba's tool therefore relies on BERT for some failure analysis and compares that with TAAT's analysis of logfile timestamps. The paper contains a lot of math describing exactly how Alibaba analyzes log info, but the bottom line was apparently a ten percent improvement in fault predictions – and presumably slightly more reliable cloudy IaaS.

Alibaba's boffins think TAAT's output is also useful because it doesn't need expert analysis – meaning folks familiar with cloudy crashes aren't needed to help as often. It's already in production at Alibaba Cloud.

TAAT appears not to be available for download. But Alibaba Cloud has posted a colossal dataset comprising "∼2.7 billion syslogs from ∼300,000 servers in a four-month period of the real productional system of Alibaba Cloud" to help researchers consider how to develop log sampling strategies of their own to inform future failure prediction efforts.

The authors have also posted a video outlining TAAT's operation. ®

Send us news
2 Comments

Admins wonder if the cloud was such a good idea after all

As AWS, Microsoft, and Google hike some prices, it's time to open up the ROI calculator

Broadcom has brought VMware down to earth and that’s welcome

But users aren’t optimistic it will land softly

The elusive dream of cloud portability: Why migrating workloads isn't so simple

Despite early promises, moving between providers remains a complex and costly endeavor

Alibaba and Tencent clouds see demand for CPUs level off, GPUs accelerate

Lenovo also cashes in on AI demand, without being able to turn it into profit

China AI devs use cloud services to game US chip sanctions

Orgs are accessing restricted tech, raising concerns about more potential loopholes

When it comes to cloud, it's China against the world

Amazon, Microsoft, and Google dominate the west, but the Middle Kingdom plays by its own rules

Alibaba Cloud claims K8s service meshes can require more resources than the apps they run

Built its own replacement – Canal Mesh – that it says leaves Google's Istio and Ambient eating dust

HPE nabs long-time ally Morpheus Data

The CMP boasts to be the orchestration platform behind GreenLake since 2022

If the world had a hyperscale datacenter capital, it would be... Northern Virginia

If you guessed Beijing, sorry – but it is number 2, according to Synergy Research figures

Cloud growth puts hyperscalers on track to take up 60% of datacenter capacity by 2029

Enterprises used to spend more on own kit than cloud infra services... now it's the other way around

Huawei Cloud built a network monitor so sensitive it spotted the impact of a single faulty chip

Focus on physical ports helped spot issues across 100,000 switches and a million servers

Japan's Fugaku supercomputer released in virtual version that runs in AWS

Graviton processors get the job of helping RIKEN achieve HPC world domination