Tuesday, 3 July 2012

Linux DDoS on Power Systems?

Imagine receiving a notice from monitoring systems that your facility is using a megawatt more in electricity for no specific reason... enough to make the average person freak out completely but unfortunately exactly what happened this past weekend when an additional leap second was added...

Part of the notice we received from Hetzner - "During the night of 30.06.2012 to 01.07.2012 our internal monitoring systems registered an increase in the level of IT power usage by approximately one megawatt.

The reason for this huge surge is the additional switched leap second which can lead to permanent CPU load on Linux servers."

Apparently a Linux kernel (including version 3.3) bug affecting the hrtimer code fails to set the system time when the leap second was added, this in turn caused an infinite loop on many systems which pushed up CPU utilization to 100% affecting power distribution systems in a huge way!

I cant help but wonder how many facilities would be able to handle the sudden impact of such a bug without going offline... The bug seems to also be the reason why Amazon EC2 went down over the weekend, Reddit, Mozilla, Gawker and more...

Setting the date seems to fix the problem with rebooting as the last resort to also fix the affected server...

Well done to Hetzner for staying online and notifying customers of the problem.

No comments:

Post a Comment