Call Interrupted: How to Stop VoIP Outages in their Tracks

There was a customer support call that perfectly demonstrates how Corvil Analytics can be the difference between uptime and downtime. This multinational firm was running VoIP across different continents and was using Corvil to monitor its VoIP performance in regional data centers. Their voice infrastructure uses kit from multiple vendors and depends on interoperability between various combinations of software, servers and devices. After one supplier carried out an upgrade, problems started to occur leading to a major outage and users complaining that they couldn’t make calls.

Before and after insights

The customer had support engineers from multiple vendors busy looking through server logs to figure out the issue while the customer’s voice operations were checking their Corvil dashboards. The vendors formed a couple of theories about the issue but struggled to prove one way or another. While looking at our dashboards, voice operations saw an anomaly around SIP messaging and contacted me to help him dig deeper.

Drilling down into the data collected on their Corvil appliances, we discovered the problem centered around SIP subscribe messages sent from one vendor’s devices to the other vendor’s server. We could see a huge increase in the number of these messages and analyzing them showed how they were looping around the infrastructure and not reaching their destination, eating up vital server power and preventing calls from being processed correctly.

We could filter and pivot the data to see the messages in the days leading up to the event and compare them with what happened during the ongoing outage. This live and retrospective analysis was the key to understanding the issue.

We could see the servers being impacted and could provide pcaps to the vendors as proof and for further analysis.

I was lucky enough to be on the multi-vendor call where I heard one of the vendors say how great the customer's analytics were – meaning the data provided by Corvil Analytics. Bringing visibility to complex environments is just one of our strengths. We don’t just identify issues that other tools miss; we provide information to make sure it doesn’t happen again.

Long-term solution

In this case, the subscribe feature was disabled and the way the servers receive messages was improved. At the same time, the client could add a new graph to their dashboard and set up a new alert that would immediately tell them if a similar problem occurred.

This type of incident is far from unique. Not long after, the same client solved a second looping issue caused by audio codec incompatibilities with one of their voice systems. Changes were quickly made to handle these calls, and a potential second outage was avoided because the new Corvil alert they set up caught something early that would previously have gone unnoticed for some time until it caused a major issue.

As organizations invest in collaboration suites, multi-vendor services, and hybrid cloud infrastructure, they are finding themselves with increasingly complex communication platforms. The reality is that legacy voice service assurance tools need help to identify and solve technical issues. At Pico, it’s what we do.