Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold

Wednesday October 25th, 12-1PM @ BA5205

Speaker: Xu Zhao

Title:
Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold

Abstract:
Complex and unforeseen failures in distributed systems must be diagnosed and replicated in a development environment so that developers can understand the underlying problem and verify the resolution. System logs often form the only source of diagnostic information, and developers reconstruct a failure using manual guesswork. This is an unpredictable and time-consuming process which can lead to costly service outages while a failure is repaired.

This paper describes Pensieve, a tool capable of reconstructing near-minimal failure reproduction steps from log les and system bytecode, without human involvement. Unlike existing solutions that use symbolic execution to search for the entire path leading to the failure, Pensieve is based on the Partial Trace Observation, which states that programmers do not simulate the entire execution to understand the failure, but follow a combination of control and data dependencies to reconstruct a simplified trace that only contains events that are likely to be relevant to the failure. Pensieve follows a set of carefully designed rules to infer a chain of causally dependent events leading to the failure symptom while aggressively skipping unrelated code paths to avoid the path-explosion overheads of symbolic execution models.

Bio:
Xu Zhao is a 3rd-year Ph.D. student at the University of Toronto, under the supervision of Prof. Ding Yuan.
His research interest lies in the area of performance of distributed systems and failure diagnosis.

His current work focuses on automated placement of logging statements and non-intrusive performance profiling for distributed systems.cluding BBC, The Register, Dailydot, and others.

The Game of Twenty Questions: Do You Know Where to Log?

Thursday May 4th, 12-1PM @ BA5205

Speaker: Xu Zhao

Title:
The Game of Twenty Questions: Do You Know Where to Log?

Abstract:
A production system’s printed logs are often the only source of runtime information available for postmortem debugging, performance profiling, security auditing, and user behavior analytics.  Therefore, the quality of this data is critically important. Recent work has attempted to enhance log quality by recording additional variable values, but logging statement placement, i.e., where to place a logging statement, which is the most challenging and fundamental problem for improving log quality, has not been adequately addressed so far. This position paper proposes an automated placement of logging statements by measuring the uncertainty of software that can be eliminated. Guided by ideas from information theory, authors describe a simple approach that automates logging statement placement. Preliminary results suggest that the algorithm can effectively cover, and further improve, existing logging statements placed by developers. It can compute an optimal log-placement that disambiguates the entire function call path with only 0.218% of
slowdown.

Bio:
Xu Zhao is a 2nd year PhD student at the University of Toronto, under the supervision of Prof. Ding Yuan. His research interests lie in the area of performance of distributed systems and failure diagnosis. His current work focuses on automated placement of logging statements and non-intrusive performance profiling for distributed systems.

Non-intrusive Performance Profiling of Entire Software Stacks based on the Flow Reconstruction Principle

Thursday October 27th, 12-1PM @ BA5205

Speaker: Xu Zhao

Title:
Non-intrusive Performance Profiling of Entire Software Stacks based on the Flow Reconstruction Principle

Abstract:
Understanding the performance behavior of distributed server stacks at scale is non-trivial. The servicing of just a single request can trigger numerous sub-requests across heterogeneous software components; and many similar requests are serviced concurrently and in parallel. When a user experiences poor performance, it is extremely difficult to identify the root cause, as well as the software components and machines that are the culprits. This work describes Stitch, a non-intrusive tool capable of profiling the performance of an entire distributed software stack solely using the unstructured logs output by heterogeneous software components. Stitch is substantially different from all prior related tools in that it is capable of constructing a system model of an entire software stack without building any domain knowledge into Stitch. Instead, it automatically reconstructs the extensive domain knowledge of the programmers who wrote the code; it does this by relying on the Flow Reconstruction Principle which states that programmers log events such that one can reliably reconstruct the execution flow a posteriori.

Bio:
Xu is a second year PhD student under Prof. Ding Yuan. His research focuses on performance failure debugging and log analysis in distributed systems.