In the past year, we developed a new product from scratch with more than 1 million messages in Kafka per second. This significant change in scale led us to many production issues that sometimes took a long time to resolve. In most cases, we upscaled the Kafka cluster, but changes on server-side are not always the best practice solution. We learned that the kafka client has lots of contributions to the Kafka system in terms of optimization and performance issues. In this session, I will talk about measuring and optimizing our performance on the kafka client-side.
What happened when our most important Kafka cluster suddenly went rogue, and while trying to recover it, we made things even worse? In one word: a disaster. While dealing with our high scale Kafka clusters, we always try to ensure Kafka behaves well under high loads of traffic and unexpected cluster failures. You can imagine the challenges we had to deal with when our main Kafka cluster went crazy. This session is the story of how we learned the hard way about mitigating cluster failures with the proper configurations in place.
Logging vs Tracing: Why Logs Aren’t Enough to Debug Your Microservices
Being a production engineer means being a problem solver. In Taboola, it is a greater challenge, since the systems are highly distributed and working on a very high scale. Thus, even simple remote debugging is not an option. In order to find an issue fast, we use effective tools, some of them are OOB, some are in-house, to find what is the actual problem and filter out the symptoms.