FullStory drinks from a fire hose of data. When you handle as many requests (billions a day!) as FullStory does, you need to be able to process them quickly. You also need to be resilient to spikes in traffic. There are myriad reasons why traffic and load may increase unexpectedly on your system. This traffic may be malicious, but very often it is completely valid. In either case, the end result could behave like a Denial of Service (DOS) attack: your application is flooded with traffic which may contain unusual data that it is not expecting. At FullStory, we’ve designed our system to mitigate these DOS-like traffic spikes using a scalable key value store along with a novel way of scheduling work to batch-process requests.
In our old system, scheduling the processing of a web page was triggered in part by timestamp data supplied by the client. Our data processors would combine whatever data had been accumulated in the key value store with any previously recorded data, save to a file, and delete the key-value entry. This approach ensured that results were timely, only required data contained in the request, and usually ensured we only performed expensive processing periodically for each page.
I say usually because this approach was susceptible to a condition that could cause heavy data processing to be performed unnecessarily since the scheduling of the data processor was dependent on data from the client. While trusting data coming from the client always comes with risks, it’s a critical part of our business. We came to rely on it for scheduling because repeatedly checking our database to determine whether data should be processed wasn’t really a viable option at FullStory scale.
Normally this reliance on the client-provided data worked out just fine, since we never return any meaningful data to the client. Occasionally though, a customer might fire up a load-test harness that had recorded heavy browser traffic, including requests to FullStory, and inadvertently send a flood of identical requests our way. Even this wasn’t usually a problem: we would record the same requests multiple times but not process them. However, on rare occasion, a request might have triggered downstream processing depending on the timestamp data that it contained. If the test harness sent many such requests, the system would dutifully queue them up all at once and clog downstream processing. While other safety systems would eventually step in and stop the incoming requests, it could take hours for the scheduled work to clear the system.
We analyzed one such event and saw that coupling expensive downstream processing to potentially unreliable client-side requests was a problem. We knew we had the capability to handle the load of incoming requests, but we still needed a way to schedule the processing in a timely manner without relying on the request itself, which can’t be trusted to not trigger a system-impacting queue backlog.
Getting on the Right Schedule
After considering a number of solutions, we hit upon our current approach: scan the contents of our key-value store to trigger processing and decouple the scheduling from the incoming request data. With the old approach, all of the duplicate requests would get recorded, and under the right circumstances ALL OF THEM would also be scheduled at the same time, causing massive load on the data processors. Under the new approach, we still capture all of those requests, but we only schedule the downstream work once during a full scan, regardless of the number of requests or their contents. If we want to decrease the latency of processing, we can scan the table faster. If the downstream systems can’t keep up, we can slow down the rate of scanning.
Migrating to the new system was a little tricky: we wrote to both the old system and the new system in parallel, but continued scheduling the work through the old mechanism. As testing was nearing completion, we were hit again, only this time mitigation wasn’t so simple as blocking a single IP. Requests were coming from all over and we couldn’t keep up. We threw the switch to use the new scanning scheduler, and haven’t looked back since.
Keeping your vulnerable systems isolated from untrustworthy requests is a critical part of designing for scale. There are certain conditions where this traffic is completely valid (like load testing) and your systems need to be able to handle them. By scheduling sequential reads, we were able to decouple our incoming requests from expensive processing. This protected us from a significant DOS attack and helped FullStory maintain service to our customers.