Cloud Tasks

6. Scaling

  • A high-TPS queue is a queue that has 500 created or dispatched tasks per second (TPS) or more. 
  • A high-TPS queue group is a contiguous set of queues, for example [queue0001, queue0002, …, queue0099], that have at least 2000 tasks created or dispatched in total. 
  • The historical TPS of a queue or group of queues are viewable using the metrics, api/request_count for “CreateTask” operations and queue/task_attempt_count for task attempts. 
  • High-traffic queues and queue groups are prone to two different broad classes of failures: Queue overload, Target overload
  • Queue overload occurs when task creation and dispatch to an individual queue or queue group increases faster than the queue infrastructure is able to adapt.
  • Target overload occurs when the rate at which tasks are being dispatched causes traffic spikes in the downstream target infrastructure. 
  • In both cases, Google recommends following a 500/50/5 pattern: when scaling beyond 500 TPS, increase traffic by no more than 50% every 5 minutes. 
  • Google does not recommend a queue with more than 1000 TPS (creates plus dispatches) as it will produce higher delivery latency than normal.
  • Queues or queue groups can become overloaded any time traffic increases suddenly and experience increased task creation latency, task creation error rate, and reduced dispatch rate.
  • To defend against this, establish controls in any situation where the create or dispatch rate of a queue or queue group can spike suddenly. 
  • Google recommends a maximum of 500 operations per second to a cold queue or queue group, then increasing traffic by 50% every 5 minutes. 
  • In theory, traffic can grow to 740K operations per second after 90 minutes using this ramp up schedule. 
  • If tasks are created by an App Engine app, leverage App Engine traffic splitting to smooth traffic increases. 
  • By splitting traffic between versions, requests that need to be rate-managed can be spun up over time to protect queue health. 
  • When launching a release that significantly increases traffic to a queue or queue group, gradual rollout is, again, an important mechanism for smoothing the increases.
  • Gradually roll out instances such that the initial launch does not exceed 500 total operations to the new queues, increasing by no more than 50% every 5 minutes.
  • Newly created queues are especially vulnerable. 
  • Groups of queues, for example [queue0000, queue0001, …, queue0199], are just as sensitive as single queues during the initial rollout stages. 
  • For these queues, gradual rollout is an important strategy. 
  • Launch new or updated services, which create high-TPS queues or queue groups, in stages such that initial load is below 500 TPS and increases of 50% or less are staged 5 minutes or more apart.
  • When increasing the total capacity of a queue group, for example expanding [queue0000-queue0199 to queue0000-queue0399], follow the 500/50/5 pattern. 
  • It is important to note that, for rollout procedures, new queue groups behave no differently than individual queues. 
  • Apply the 500/50/5 pattern to the new group as a whole, not just to individual queues within the group. 
  • For these queues group expansions, gradual rollout is again an important strategy.
  •  If the source of traffic is App Engine, use traffic splitting. 
  • When migrating service to add tasks to the increased number of queues, gradually roll out instances such that the initial launch does not exceed 500 total operations to the new queues, increasing by no more than 50% every 5 minutes.
  • An existing queue group may be expanded because tasks are expected to be added to the queue group faster than the group can dispatch them
  •  If the names of the new queues are spread out evenly among existing queue names when sorted lexicographically, then traffic can be sent immediately to those queues as long as there are no more than 50% new interleaved queues and the traffic to each queue is less than 500 TPS. 
  • This method is an alternative to using traffic splitting and gradual rollout.
  • When a large number of tasks, for example millions or billions, need to be added, a double-injection pattern can be useful.
  • Instead of creating tasks from a single job, use an injector queue. 
  • Each task added to the injector queue fans out and adds 100 tasks to the desired queue or queue group.
  • The injector queue can be sped up over time, for example start at 5 TPS, then increase by 50% every 5 minutes.
  • When a new task is created, Cloud Tasks assigns the task a unique name by default. 
  • A name can be assigned to a task using the name parameter. 
  • Name parameter introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks. 
  • These costs can be magnified significantly if tasks are named sequentially, such as with timestamps.
  • If assigning own names, use a well-distributed prefix for task names, such as a hash of the contents.
  • Cloud Tasks can overload other services that such as App Engine, Datastore.
  • Network usage, if dispatches from a queue increase dramatically in a short period of time
  • If a backlog of tasks has accumulated, unpausing queues can potentially overload these services. 
  • The recommended defense is the 500/50/5 pattern suggested for queue overload.
  • If a queue dispatches more than 500 TPS, increase traffic triggered by a queue by no more than 50% every 5 minutes.
  • Use monitoring and logging metrics to proactively monitor traffic increases. 
  • Stackdriver alerts can be used to detect potentially dangerous situations.
  • Unpausing or resuming high-TPS queues
  • When a queue or series of queues is unpaused or re-enabled, queues resume dispatches.
  • If the queue has many tasks, the newly-enabled queue’s dispatch rate could increase dramatically from 0 TPS to the full capacity of the queue. 
  • To ramp up, stagger queue resumes or control the queue dispatch rates using Cloud Tasks's maxDispatchesPerSecond.
  • Bulk scheduled tasks
  • Large numbers of tasks, which are scheduled to dispatch at the same time, can also introduce a risk of target overloading. 
  • To start a large number of tasks at once, consider using queue rate controls to increase the dispatch rate gradually or explicitly spinning up target capacity in advance.
  • Increased fan-out
  • When updating services that are executed through Cloud Tasks, increasing the number of remote calls can create production risks.
  • Use gradual rollout or traffic splitting to manage ramp up.
  • Retries
  • Code can retry on failure when making Cloud Tasks API calls. 
  • When a significant proportion of requests are failing with server-side errors, a high rate of retries can overload queues even more and cause them to recover more slowly. 
  • Google recommends capping the amount outgoing traffic if a client detects that a significant proportion of requests are failing with server-side errors.