In this release we have adapted our emergency rescue automation to better decide when to continue with performing automated steps to rescue the server instead of deciding that it is up again and leaving it be. When a node is down, alongside with alerting the on-call technician, we automatically try to detect various known issues and then perform steps in order to try to programmatically fix it, like detecting known bot patterns and deploying NGINX rules to fend off an attack, or restarting certain services when processes get stuck. This system is separate from our on-call monitoring and alerting, and it employs a different less strict heuristic for determining when to continue with performing these rescue attempts.

Today we deployed a change to this system which will make it so that we can better detect flapping nodes as ‘still down’ and in need of automated rescue (like by restarting the services). Previously we would early out of the rescue process as soon as the app showed any sign of life, but now we look for a pattern that indicates that the app is up and has stayed up for a longer period to rule out marking any rapidly flapping web-shops as fixed prematurely.

In case of downtime there is a balance on when to act because if the node is down for a short time due to a heavy server level process like a periodic import, then the services shouldn’t be restarted too early to ‘fix’ the server, because that would be counterproductive (the process would be interrupted, the caches would be empty, etc). On the other hand, when a server is flapping between an up and down state our automation should see that as an issue and act accordingly, as prolonged flapping regularly does indicate a problem.

In this release we have also made a change to our Nginx configuration to block certain requests to moxieplayer.swf. This is an older mitigation for blocking some known issues long since patched in SUPEE-10752, but we figured it would be a good thing to take up in our core Nginx configuration anyway. For more information see this Magento StackExchange comment or look at /etc/nginx/security.conf to inspect the configuration.

Before:

After:

That change will be deployed on all Hypernodes over the course of this week.