Kurento WebRTC Connection Fails in ~30% of Cases

2.5k views Asked by At

I have spent days hunting down a connection problem without any luck. I'm trying to implement a relatively simple one2one Call with Kurento.

Below you will find a debug log of Kurento of a case where the connection could be established and a case where the connection failed.

If you need any more logs (eg. of the clients, the signalling server, tcpdumps, or trace logs of Kurento just let me know and I will provide!)

Any help or new input is greatly appreciated!

Description of the Problem:

In about 30% of cases, the WebRTC connection cannot be established. Unfortunately I'm short of any kind of patttern when the Connection can be established and when not, it seems completely random. I'm in the same network, using the same devices, using the same TURN server, using the same signalling protocol, but in 30% of cases the connection cannot be established.

When I run the application locally, it seems to work much more reliably, the connection can be established almost 100% of the time (or maybe even 100% of time, I have tested so many times I lost track). I set up the infrastructure locally with docker, and run the different containers (TURN, Kurento, Signalling) in separate networks to mimic a production deployment.

We experience the same behavior in our development and production environment. In our development environment we have absolutely no firewalls in place, so that doesn't seem to be the problem.

What I have tried to find the cause of the Problem:

Mostly I have been comparing logs of cases that worked and cases that didn't work but I have failed to find any significant difference between them that could point me to the problem.

I have tested the WebRTC connection over the TURN server (with Firefox and the force_relay flag) and over Kurento directly, but in both cases the connection fails in ~30% of cases.

I have tried filtering all ICE candidates that are not Relay candidates.

I have sniffed traffic between our signalling server (which also controls Kurento) and Kurento to see any difference in the JSON RPS messages exchanged but they appear to be essentially the same.

I have tested our STUN and TURN server using this tool: https://webrtc.github.io/samples/src/content/peerconnection/trickle-ice/ and I get both serverreflexive and relay candidates that look correct

I have sniffed the traffic from the clients of a successful and unsuccessful connection but could spot a significant difference

I have simplified the Kurento media pipeline (no recording, no Hubs) but the behavior is the same

I have used different browsers (Chrome, Firefox and a native iOS implementation) but the behavior is the same

Kurento debug logs of a case where the connection could be established:

https://gist.github.com/omnibrain/2bc7ad54f626d278d3c8bac29767ac4c

Kurento debug logs of a case where the connection could NOT be established:

https://gist.github.com/omnibrain/f7caee04a5c6d77ea22a9ccfa95dd825

2

There are 2 answers

3
J. Longman On

Looking at your traces, your working case selects candidate 10 then selects candidate 7, the non-working only chooses candidate 10.

kurento_logs_webrtc_working.txt

New candidate pair selected, local: 'candidate:10 1 UDP 335544831 10.10.36.134 50589 typ relay raddr 172.19.0.2 rport 9', remote: 'candidate:3993072200 1 UDP 41885695 10.10.36.134 53894 typ relay', stream_id: '1', component_id: 1
...
New candidate pair selected, local: 'candidate:7 1 UDP 1677722111 10.10.36.131 46842 typ srflx raddr 172.19.0.2 rport 46842', remote: 'candidate:266015763 1 UDP 2122260223 10.10.1.57 55125 typ host', stream_id: '1', component_id: 1

kurento_logs_webrtc_NOT_working.txt

new candidate pair selected, local: 'candidate:10 1 UDP 335544831 10.10.36.134 51280 typ relay raddr 172.19.0.2 rport 9', remote: 'candidate:3993072200 1 UDP 41885695 10.10.36.134 51287 typ relay', stream_id: '1', component_id: 1

My first thought was you were re-using old candidates but the ports have changed. Changing browsers might change the candidate numbers, I didn't expect them to be deterministic between runs so I had to look twice.

There's one minor difference from the log - that the non-working IceComponentStateChanged changes to connecting after the candidate:266015763 appears rather than before. I don't know if that's significant.

General notes:

In the past when we've a couple of category of problems:

  • client side we were losing ICE candidates - some of the candidates were sent before we were ready so we needed to queue them. IIRC there are potentially some candidates in the SDP offer (or answer, sorry it's been a while) so your listeners need to be ready before you start.
  • we were sending OLD candidates - the wrong replay element was used in the signalling server to offer candidates to clients and these were no longer valid.

I'd recommend you use Chrome with chrome://webrtc-internals to help. ICE candidate problems are visible in the webrtc-internals as you can see the state machine walking through its state. There are many more transitions in our working case rather than broken cases.

Adding client-side listeners for the three ice events are helpful too:

this.peer.peerConnection.oniceconnectionstatechange = this.logloglog.bind(this, this.peer.peerConnection);
this.peer.peerConnection.onicegatheringstatechange = this.logloglog.bind(this, this.peer.peerConnection);
this.peer.peerConnection.onsignalingstatechange = this.logloglog.bind(this, this.peer.peerConnection);

This lets you see how the negotiation is going but is basically what is in the chrome://webrtc-internals.

final note, this is what I used in the logging section /etc/default/kurento-media-server:


# ICE debug logging: uncomment to enable in KMS and in the 3rd-party library 'libnice'
# - Note: This can get very verbose, log size will explode in the long term
#export GST_DEBUG="$GST_DEBUG,kmsiceniceagent:5,kmswebrtcsession:5,webrtcendpoint:4"
export G_MESSAGES_DEBUG="libnice,libnice-stun"
export NICE_DEBUG="$G_MESSAGES_DEBUG"

I don't remember whether they were better than what you used but I'll throw that out there.

2
omnibrain On

After days of debugging and almost going insane we finally found the cause of our problem:

We used the Swift client of Socket.IO and the Java Netty Socket.IO server implementation of Socket.IO. The client (iOS) is using long-polling to communicate to the server (Java). It turns out that Netty Socket.IO server is URL-decoding the long-polling payload of the Swift Socket.IO client, but the Swift Socket.IO client doesn't actually URL encode it. This meant that every "+" that was sent from the Swift Socket.IO client was replaced by " " (a space) on the server. Why is this a problem? Because the SDP offer of the client contains a ufrag, which can contain a plus sign! So if the SDP contained a '+', it was replaced on the server with a space, which caused the STUN pings to fail because the message integrity could not be validated.