For all the technogeeks out there here's what happened
A Redis server (which acts as our search index) stopped. This sort of thing happens every now and again. However it should have automatically restarted, but for some reason didn’t.
We started it up again manually, and that all seemed to go fine.
The main symptom of the index disappearing was timed out search requests because the Redis driver holds them whilst it attempts to reconnect under the surface.
Unfortunately (in this case) our apps/clients have a bunch of retry logic in them if requests time out. This meant that everyone’s searches were being retried over and over again.
Once the index came back online the requests could go through, but there were so many of them that the Redis server was reduced to a crawl. This then caused the requests to timeout, and be retried…
To stop this endless cycle of doom, we put in some code to just return a straight up error to every search request and not to hit the redis server. This allowed the redis server to sort itself out, and stopped the clients from continuing their retries.
After leaving that in place for 10mins we took it out and allowed searches to work properly.
Posted May 18, 2017 - 12:34 BST
OK - no relapses
Posted May 18, 2017 - 12:28 BST
We think we've cracked it, and searches should now be working again. We'll keep an eye on it for a bit though just to make sure.
Posted May 18, 2017 - 12:13 BST
Search across our websites and apps is currently unavailable - we've identified the problem and are trying to get it fixed ASAP