Comment 15 for bug 1317811

Revision history for this message
Stéphan Kochen (stephank) wrote :

I can't comment on the driver implementation details, but I can give some further details about our experience.

The app in question was a second screen app for the dutch public broadcasting network for the Eurovision Song Contest. The app was live for two semi-finals on tuesday the 6th and thursday the 8th, as well as the finals saturday the 10th. Load was lowest on the thursday, when the Netherlands did not perform, and highest saturday during the finals. We ran c3.large instances for all shows.

During the first run on tuesday was when we first noticed the issue.

Shortly before the second run on thursday we noticed the high MTU setting as a possible cause, and changed it to 1500 on half of our machines in the redundant setup. There was a clear difference in connection stability between these machines.

For the third run on saturday, we had all machines on the normal MTU of 1500, as we adjusted our startup scripts to force the setting. We had zero connection issues that night, and clean kernel logs, even though this night saw the highest network load of all three.

We have several m1.small instances running 24/7 as well, and these have clean kernel logs, but their network load is quite low. The MTU on these has always been untouched, and is a normal 1500, apparently by default.

In the instance type list, EC2 shows Compute Optimized instances as having Enhanced Networking. Even though we don't qualify for it, perhaps the networking setup is different for these instances. https://aws.amazon.com/ec2/instance-types/

About a custom kernel, we'd have to look into deploying it, or reproducing the issue on a smaller test setup. I'd prefer looking into the latter, because maybe we can reproduce it between just two instances with stress tools.