Last month I joined KatherineOfSky's MMO event as a player. I noticed that after we reached a certain number of players, every few minutes a bunch of them got dropped. Luckily for you (but unluckily for me), I was one of the players who got disconnected every, single. time, even though I had a decent connection. So I took the matter personally and started looking into the problem. After 3 weeks of debugging, testing and fixing, the issue is finally fixed, but the journey there was not that easy.
Multiplayer issues are very hard to track down. Usually they only happen under very specific network conditions, in very specific game conditions (in this case having more than 200 players). Even when you can reproduce the issue it's impossible to properly debug, since placing a breakpoint stops the game, messes up the timers and usually times out the connection. But through some perseverance and thanks to an awesome tool called clumsy, I managed to figure out what was happening.
The short version is: Because of a bug and an incomplete implementation of the latency state simulation, a client would sometimes end up in a situation where it would send a network package of about 400 entity selection input actions in one tick (what we called 'the megapacket'). The server then not only has to correctly receive those input actions but also send them to everyone else. That quickly becomes a problem when you have 200 clients. It quickly saturates the server upload, causes packet loss and causes a cascade of re-requested packets. Delayed input actions then cause more clients to send megapackets, cascading even further. The lucky clients manage to recover, the others end up being dropped.
The issue was quite fundamental and took 2 weeks to fix. It's quite technical so I'll explain in juicy technical details below. But what you need to know is that since Version 0.17.54 released yesterday, multiplayer will be more stable and latency hiding will be much less glitchy (less rubber banding and teleporting) when experiencing temporary connection problems. I also changed how latency hiding is handled in combat, hopefully making it look a bit smoother.
The basic way that our multiplayer works is that all clients simulate the game state and they only receive and send the player input (called Input Actions). The server's main responsibility is proxying Input Actions and making sure all clients execute the same actions in the same tick. More details in FFF-149
Since the server needs to arbitrate when actions are executed, a player action moves something like this: Player action -> Game Client -> Network -> Server -> Network-> Game client. This means every player action is only executed once it makes a round trip though the network. This would make the game feel really laggy, that's why latency hiding was a mechanism added in the game almost since the introduction of multiplayer. Latency hiding works by simulating the player input, without considering the actions of other players and without considering the server's arbitrage.
In Factorio we have the Game State, this is the full state of the map, player, entitites, everything. It's simulated deterministically on all clients based on the actions received from the server. This is sacred and if it's ever different from the server or any other client, a desync occurs.
On top of the Game State we have the Latency State. This contains a small subset of the main state. Latency State is not sacred and it just represents how we think the game state will look like in the future based on the Input Actions the player performed.
To do that, we keep a copy of the Input Actions we make, in a latency queue.
So in the end the process, on the client side, looks something like this:
Getting complicated? Hold on to your pants. To account for the unreliable nature of internet connections, we have two mechanisms:
Now it's time to explain how our entity selection works. One of the Input Action types we send is entity selection change, which tells everyone what entity each player has their mouse over. As you can imagine this is by far the most common input action sent by the clients, so it was optimized to use as little space as possible, to save bandwidth. The way this was done is that each entity selection, instead of saving absolute, high precision map coordinates, it saves a low precision relative offset to the previous selection. This works well, since a selection is usually very close to the previous selection. This creates 2 important requirements: Input Actions can never be skipped and the need to be executed in the correct order. These requirements are met for the Game State. But, since the purpose of the Latency state is to "look good enough" for the player, these requirements were not met. The Latency State didn't account for many edge cases related to tick skipping and roundtrip latency changing.
So you can probably see where this is going. Finally the issue of the megapacket started to show. The final problem was that the entity selection logic relied on the Latency State to decide if it should send a selection changed action, but the Latency State was sometimes not holding correct information. So, the megapacket got generated something like this:
Ironically, the mechanism that was supposed to save some network bandwidth ended up creating massive network packets.
In this end this was solved by fixing all the edge cases of updating and maintaining the latency queue. While this took quite some time, in the end it was probably worth doing a proper fix instead of some quick hacks.
Clusterio is a scenario/server system that adds communication of game data between different servers. For instance sending items between different servers, so you can have a 'mining server', which will mine all the iron ore you need, and send it to the 'smelter server' which will smelt it all and pass it further along.
Last year there was an event using the system which linked over 30 servers together to reach a combined goal of 60,000 Science per minute. MangledPork has a video of the start of last years event on YouTube.
The great minds and communities behind the last event have come together again to host another Clusterio event: The Gridlock Cluster. The goal this time is to push the limits even further, explore and colonise new nodes as they are generated, and enjoy the challenge of building a mega-factory across multiple servers.
If you are interested in participating in this community event, all the details are listed in the Reddit post, and you can join the Gridlock Cluster Discord server.
As always, let us know what you think on our forum.