My name is Thoriq Satriya. I would like to share my recent experience creating Lobby service for an AAA online multiplayer multi-platforms game. Lobby service is a microservice that enables players to connect with each other, so that they can play in multiplayer mode. The architecture of Lobby is quite complex, mainly because it’s designed to handle a big number of concurrent users. In order to handle such big concurrency, Lobby spawns a big number of goroutines. There are goroutines to handle players, send and receive messages to player, send and receive messages between Lobby services, and instrument the Lobby status to be sent to telemetry for monitoring purposes. Most of these goroutines need to be able to communicate with each other. To achieve that, Lobby utilizes channel to communicate between goroutines.
The Lobby service leverages WebSocket, a communication protocol that provides full-duplex communication channel over a single TCP connection, to enable persistent connection between client and server with lower overhead, facilitating lightweight real-time data transfer. In result, it can handle a real big number of concurrent accessing users! Effective interaction between client and server was made possible by keeping the connection open — messages can be passed back and forth without making new connection in each transfer. I dealt quite some time with WebSocket and it was one of the interesting parts in doing the Lobby.
Beside writing the service, I did the load testing too. That our goal is to serve as many people as possible concurrently, load testing is needed to ensure the service’s performance under real-life load conditions. In doing load test, my hardest problem to solve was this one bug related to concurrency. I didn’t see the bug when I tested the Lobby using small number of concurrent connections. I could only find out there was this bug when the Lobby was tested against big number of concurrency. Imagine there are millions of goroutines running and then a panic happens. All of the error logs from all goroutine will be printed and the debugging will be very hard! I needed to turn off rate limiter of logging system so that there’s no important part missing. After then, load test was run and when the error occurred, the size of log file was more than 1GB. My team and I needed to filter those logs to find out the actual error message.
I also found a couple of exciting challenges during the test setup (tuning). The first one was in setting the OS, which I needed to tune the most, to make the OS capable to accept a million simultaneous connections. Basically OS has setting to limit resources in being used at a time. Proper OS tuning improves system performance by preventing error conditions occurrence that can degrade performance. I was faced with ‘open files’ limit while tuning OS in Lobby deployment. This limit states the amount of files that can be opened at a time. This process was translated into connection because connection socket is treated as file. That was one tricky case for me to solve (I enjoyed solving this though). Another challenge I experienced in the setup was performance degradation when the team moved the environment for testing from AWS C4 into C5. My team and I could not utilize the resource of C5 instance even though we used the same tuning configuration. So I assumed that there must be some other tuning configuration for C5 instance that we just needed to find.
I learned a lot of things doing this project, from how distributed system works to how to handle concurrency. Here are some logic I learned: Lobby is a microservice which should be able to be scaled horizontally; in order to make it that way, Lobby needs the ability to communicate between each services; in order to do so, we share the data between lobby services using Redis (as an in-memory database). To comply with the goal to withstand concurrency I wrote using Golang (this was my first experience coding with Golang!). I consider the language to be very good in handling concurrency — I can use the channel to synchronize between goroutines without explicit lock and condition variable!
This experience enriched my knowledge bank for it made me learn a lot.
AccelByte environment plays a great role too. The support was great, everyone is willing to help each other, even those guys from different teams!