Pub/Sub Performance Testing Demystified: Lessons from the Trenches

Here are the top 4 things I'd share with someone setting up performance tests for their Google Cloud Pub/Sub-based app

Pub/Sub Performance Testing Demystified: Lessons from the Trenches

I have a special respect for performance testing. Of the several things I’ve learnt to do over the past two years, this has been, by far, the most tricky to grasp (I’m still very much in the process of doing so). I’ve begun to think of it as a little bit of an art, besides being a big science. 🎨🧪

But, performance testing a Cloud Pub/Sub-based application can add a unique layer of challenges. With documentation being relatively sparse, and a few of my findings coming after much toiling, I thought it best to share with you, through this article, the top 4 lessons that I learnt.

Here’s a quick peek-

#1 Spend time defining what you’re measuring, before the how
Understanding cloud monitoring metrics and simulating production-like traffic.
#2 Know your JMeter plugins
Choose from the open-source options available.
#3 Batching—beware!
Understanding how and why batching can impact your app’s performance.
#4 Optimization: Tuning your consumer is tricky but crucial
Tuning your consumer: balancing polling threads and intervals, and adapting pod-scaling for the pub/sub world.

Let’s get started!

💡
Watch out for callouts like these in each section, which summarize the key takeaways and provide actionable advice.

#1 Spend time defining what you’re measuring, before the how

This is a good guideline for any performance test in general. When working with Pub/Sub—especially if it is the first time you’re doing so—spending time on this becomes even more important.

In the (typically) synchronous, HTTP-based world most of us have started out with, thinking of performance is fairly simple. Make a request from a client, and wait for the response. Performance can be primarily based on response times. Identifying failed transactions is straightforward too. And, when these metrics degrade, you know that’s a red flag.

Sure, this is an oversimplification, but what I meant to highlight are the key differences in the cloud-based async messaging world. Here are a few important points to keep in mind

  1. Google Cloud Monitoring becomes an essential pre-requisite

    Since you can’t glean much from the publishing app any longer (in other words, your testing tool which drives traffic), you’re left with two other options: (a) custom metrics published by your subscriber app (i.e, the app whose performance you’re testing), (b) Pub/Sub metrics on Google Cloud Monitoring.

    While you’ll likely have to consider both in unison to gain the deepest insight, the Pub/Sub metrics can probably tell you almost all you need to know from a performance perspective. It provides plenty of metrics to paint a detailed picture. Understanding each of these metrics is the first prerequisite. The second is learning to view all of them in totality, to see the big picture.

    Google Cloud Monitoring dashboard samples, illustrating Cloud Pub/Sub metrics

    💡
    If you haven’t spent much time on the monitoring side of things, playing around with the metrics on your cloud provider’s web console and creating various ad-hoc charts/dashboards can prove an insightful activity.
  2. Simulating production-like traffic can prove challenging—but is very much necessary

    Prod-like simulations can be hard to do, and it often depends on how far you’re willing to go and how much time you have. Here is a quick checklist from a pub/sub POV

    1. Publisher: ensure the publisher configurations on your testing tool, which generates traffic, match the configurations on your publisher in production: batching settings (this is crucial, more in section #3), retry settings like number of attempts and backoffs, etc.

    2. Traffic simulation: Generate traffic volumes that match peak production load.

      Simulate burst traffic if applicable, to test how batching and throttling handle sudden spikes.

    3. Consumer: again, ensure the consumers setup on your app (running in your test environment, which you’re hitting with the load) matches the deployment in production—to start with. This provides an accurate reflection of your app’s capabilities, as it stands currently.

      It’s also the key aspect that may need some tweaking to achieve the best performance results. Once satisfied, you would then do the reverse, applying these findings and tuned settings to your app in production. But, what could this “tuning” of the consumer look like? This requires some detailed explanation, so a section has been dedicated to it: #4.

#2 Know your JMeter plugins

JMeter is the trusty performance testing tool a large section of the developer community turns to. This also means it’s got great open-source plugin support overall, and it’s no different when it comes to Cloud Pub/Sub.

If, like me, you’re working with Google Cloud Pub/Sub, here is a plugin I would recommend straight away—the Jmeter-pubsub-sampler.

I recommend it for two reasons: first, it’s quick and easy to set up and use. Second, it’s got a simple and easily navigable open-sourced codebase. This is important in case you need to make a few crucial changes, to customize it as per your need—now, or perhaps in the future. For example, the app I needed to test expected a protobuf as input, on the incoming Pub/Sub message, so this required some tweaking on the plugin.

💡
Finding the right plugin that’s easy to use and has a navigable codebase (for any customizations you may want to make) can give you a big boost.

#3 Batching—beware!

As you likely know, batching in Cloud Pub/Sub is the method of aggregating messages before actually calling the Pub/Sub API each time. Sort of like carpooling IRL. When leveraged effectively, it can help achieve good throughput without driving up costs by much. For a detailed explanation, check out this article.

But, syncing your publisher in production with your traffic driver for tests (a JMeter plugin) is crucial. Remember that if you’ve chosen to set up batching on your publisher application in production, you would need to replicate it accurately on your testing tool. And, more importantly, vice-versa—that is, if batching is not enabled on your publisher app, then ensure it’s turned off on your test setup too. I say this because, in most client libraries for Pub/Sub, batching is enabled by default while creating a Publisher.

While the reasoning might be obvious, it’s rather easy to miss, as explained above. At least, I’m guilty of this. 🙈

💡
Ensure batching and flow control settings on the publisher in your testing tool perfectly match those on your publisher app in production.

(If you’re using the JMeter-pubsub-sampler I recommended earlier, I’ve made a contribution to the plugin so that it now allows you to quickly turn batching off or setup custom values for basic batching settings, simply through the configuration files itself. No need to fork and create your own version to make this simple change anymore.)

#4 Tuning your consumer(s) is key

This section is a bit of a mixed bag—but the objective is to tune your consumer for maximum efficiency and performance, which is, after all, the purpose of this whole exercise. Here, I've tried to put together a few very important call-outs.

  1. Understanding the pull operation is important, in case you use a pull subscription. Balancing the polling period and max message count (i.e., how often your app pulls messages from the subscription and the maximum number of messages per pull) is crucial for efficiency. One key consideration is that each pull operation incurs network overhead and resource costs. Polling too frequently can lead to excessive network requests and context switching, reducing overall efficiency. So while you may want to check for messages often, there might be such a thing as too often.

  2. If you have sharp, transient traffic spikes, you might want to consider implementing flow control to ease things out on the consumer (I'd spoken of flow control on the publisher end in #3). You can read more about implementing flow control on your consumer here.

  3. If you use autoscaling (e.g., Kubernetes HPA or GCE autoscaling), consider tuning it specifically for Pub/Sub workloads. Metrics like the number of undelivered messages and backlog size can be useful indicators. It is recommended that you base your signal on more than one metric though.


That’s a lot of Pub/Sub and performance testing for one blog. See you in the next one. Adieu! 👋