Salesforce is still working to fully restore services to clients after a serious outage that began late last night. Initially their entire portfolio of customer relationship management services were inaccessible to users; but most were restored within a few hours.
As a major player in the software-as-a-service industry, Salesforce are clearly focused on keeping their services up and running; that doesn’t mean that there is never any unplanned downtime. For example we blogged last year about both:
Whilst the level of downtime is most likely much less than most of us can achieve hosting systems on premises; we still need to recognise the risk and be prepared to respond as and when a disruption occurs.
Update 19th May…
At a briefing on Wednesday 19th May, Salesforce attempted to explain the outage as the result purely of errors by a single engineer. This instantly brings to mind BA’s massive data centre outage in May 2017, which the company blamed on mistakes by an individual.
This attribution of incidents to human error goes against decades of research (and common sense); and calls into question if any real learning is taking place in these organisations.
All the evidence suggests that one must look beyond the immediate trigger event (which may indeed have been human in origin) to the underlying technological and cultural issues that allowed such an event to escalate in the way that it did; if we are to prevent future similar incidents.