Towards Automated Post-mortem Debugging of Distributed Systems

Thursday Feb 7th, 12-1AM @ BA5205

Speaker: Xu Zhao

Towards Automated Post-mortem Debugging of Distributed Systems
Diagnosing failures in the production environment are notoriously difficult and extremely time-consuming. Most existing debugging tools are intrusive, which incur non-negligible performance overhead and do not fit for failure diagnosis in production. As a result, the state-of-art production failure diagnosis still heavily relies on the logs generated by the conventional printf-debugging.

However, manual debugging with raw logs is a pain, especially when the logs are in low quality or high volume. This talk will focus on how to automate this process. First, I will introduce Log20, a tool that automatically places the log printing statements to improve the log quality. It solves the problem of “where to log” by enabling the developers to choose the right balance between the usefulness and performance overhead of the logs. Second, I will present the Flow Reconstruction Principle, a principle that developers must follow to be able to reconstruct the system execution flow from the logs. I will also show how to apply this principle in the log analysis tool, Stitch, to diagnose real-world failures.

Xu Zhao is a Ph.D. candidate in the Department of Electrical and Computer Engineering, University of Toronto. His research interests are in building automatic tools to enhance software logging and diagnose software failures. Before coming to Toronto, Xu earned his Bachelor’s degree in Computer Science and Engineering from the Tsinghua University in 2013. Xu was awarded the Facebook Fellowship in 2018.