As a relatively new company, we are grateful for great employees we can rely on. One of them is Vincent, who works as AccelByte’s backend software development engineer. He has been doing great job since his joining. While this guy did load testing on the IAM service which AccelByte team built alongside with some other cool projects, his skills are exposed the most in doing First Strike Games (FSG) telemetry.
Vincent had been helping FSG building their telemetry which needs to withstand at least 100k requests per second (RPS). He told the team how rewarding it was to work with FSG on the project, “I learned a lot of things from this project — from the checkpoint mechanism to how-to overcome S3 PUT limit request.”
Vincent told us about his valuable experience on the checkpoint mechanism in distributed system; it was when the team decided to use AWS Kinesis stream to bridge data from the ingestor to the processor with the SDK that it provides, Kinesis Client Library (KCL). While the team had already gone advanced with Golang, it turned out that the language was actually not supported by the service at that time. In order to solve that, Vincent and the team came up with the idea to rewrite several functionalities of the SDK to Golang.
Yes, you read that right, he used that language developed by Google to build the telemetry system. He basically helped creating a system that automatically records and transmits data from remote sources to IT resources for monitoring and analysis purposes. The telemetry system is divided into two parts, the ingestor which sends data to the stream and the processor which consumes data from the stream to be sent somewhere else (might as well be other stream).
“I wrote those two parts using Golang. I’d never imagined Golang would help me a lot in the telemetry project. Its clear and concise syntax helped me debug more easily compared to other languages like Java, for instance, where magic happens around every corner,” Vincent added.
The concurrency of the language helped him to achieve performance target on the processor side where the consuming speed needs to be high enough that it is not left behind by the ingestor writing speed. Other than its concurrency, Go’s documentation was also mentioned by Vincent to be helpful for being ‘clear’ that you can learn it easily. The mentioned transcendences of Golang became Vincent’s reason to use the language to do the telemetry system.
Vincent did write the ingestor and the processor, however, he also totally designed the processor architecture — the other team member and our CTO are the ones who designed the architecture for ingestor. The processor architecture is different with KCL in implementation; while KCL is implemented to listen to all shards in the stream, which our CTO thought it won’t be easy for horizontal scaling, each instance of our processor listens to one shard so they can easily be scaled horizontally. In short, 130 processor instances for 130 shards.
Everybody faces obstacles in their job and so does Vincent. One problem he faced while doing his telemetry was the SlowDownException from AWS S3 PUT request, which is intended as protection against intentional and unintentional resource over consumption. With high request rate triggering the protection mechanism, SlowDown errors occurred. Vincent did further research on this matter which led him to the root of the problem: the hot spots. These hot spots need to be avoided in order to scale the S3 to 100k RPS. Vincent found out that introducing randomness in key prefixes is the solution. Yet, the errors didn’t just disappear. This guy came up with an idea to use one PUT request for many events instead of doing ‘a PUT request for an event’ way. In effect, the number of PUT requests made to the S3 was reduced.
“It works! The SlowDown errors finally left me alone and I proved how our telemetry handle up to 100k RPS through the load test!”
Update from Vincent:
The randomness could be achieved by using UUID and date combination, e.g. [random four characters of UUID]/[ddMMYYYY]/[UUID].json.
The key randomness might be outdated since there is another built-in solution by Amazon that doesn’t require us to have key randomness to achieve high performance anymore. Reference: https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/.