This explains the vulnerability mentioned previously. The vulnerability is an uncertainty by a sending application regarding whether messages have been received or not. This is not directly a security vulnerability. It's inherent in the TCP/TLS design, and normally is only an issue with mobile devices.
The real world sequence of events is this:
- Application write/sends X bytes to other side of TCP connection.
- Network is disconnected.
- Application gets a "Reset" or similar error from read/write/close operation.
Details of the system I/O functions don't matter. In all these cases, the application is left uncertain whether the X bytes were received or not. The "Reset" and related errors are an indication that something went wrong. Previously acknowledged messages might have still been in buffers, in transit, etc. Some of them might have been lost. TCP itself cannot know, because it is possible that the network disconnect prevented a TCP acknowledgement from arriving.
The application mest decide whether it cares. Possible situations range from:
- Everything made it just fine and was acknowledged. This is quite common in situations where there was a long delay between the last send and a close(). Odds are that the network disconnect is indirectly related to the close and occurred long after the TCP transmissions were completed.
- Everything in the TCP window could have been lost. This can happen when the network is lost during a large transmission. TCP has all this data buffered and is trying to get it through. It's using timeouts, retries, etc. If the network failure is just a transient, it will succeed. But, in this example, a "reset" indicates that it cannot succeed and has failed. There might be a complete TCP window's worth of buffers that were previously accepted but did not get transmitted.
There are also a variety of intermediate states possible that the application could determine.
This is where application ack is sometimes proposed as a solution. It's initially appealing, but it's still imperfect. The ack could be the only part of the transmission that was lost. Application acks do make sense in a number of application contexts. They don't automatically solve the underlying problem. More protocol design will be needed, with idempotency or transaction rollback dealing with parts of the problem.
This kind of timing occurs naturally with mobile devices. The form that I see most often is a real world situation where the user finished a task with a mobile device, disconnects it, and moves on the next task. Sometimes the mobile device is still transmitting results over the network when this happens. The disconnect can be a cable disconnect or simply moving the device out of WiFi range.
Asking users to wait does not work. They may understand the issue and try, but in practice it's like asking people not to stub their toe or spill their coffee. They understand and try, but it still happens. It's more realistic to design a device that survives these occasional problems.
So, when the client 'decides' that something bad happened, then it should abortively-reset the TCP connection. If the network is still intact, this will notify the service. If it is not intact, no harm done; but also anything that is queued can be freed quickly.
Then as you indicate, just send whatever you think didn't make it, which might be everything.
Posted by: John | January 02, 2012 at 08:26 PM
In rsyslog, I "solved" the problem by a dedicted protocol named RELP, but it really "solves" only extreme cases, just as you describe. Of course, even with app-level acks, you never know if just the ACK was lost or the message itself. RELP tries to circumvent this by keeping state on both sender and receiver and doing a check for outstanding transactions when a connection is re-established. However, this state may grow and so there may come some time when it is necessary (or at least very useful) to discard some of that state. In any case, it is a very exotic problem, and I doubt it is really a problem that must be solved under all circumstances. Idempotent messages are actually much better at solving this problem (if a solution is actually needed for the use case).
Posted by: Rainer Gerhards | March 05, 2012 at 12:14 PM