Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold

Wednesday October 25th, 12-1PM @ BA5205

Speaker: Xu Zhao

Title:
Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold

Abstract:
Complex and unforeseen failures in distributed systems must be diagnosed and replicated in a development environment so that developers can understand the underlying problem and verify the resolution. System logs often form the only source of diagnostic information, and developers reconstruct a failure using manual guesswork. This is an unpredictable and time-consuming process which can lead to costly service outages while a failure is repaired.

This paper describes Pensieve, a tool capable of reconstructing near-minimal failure reproduction steps from log les and system bytecode, without human involvement. Unlike existing solutions that use symbolic execution to search for the entire path leading to the failure, Pensieve is based on the Partial Trace Observation, which states that programmers do not simulate the entire execution to understand the failure, but follow a combination of control and data dependencies to reconstruct a simplified trace that only contains events that are likely to be relevant to the failure. Pensieve follows a set of carefully designed rules to infer a chain of causally dependent events leading to the failure symptom while aggressively skipping unrelated code paths to avoid the path-explosion overheads of symbolic execution models.

Bio:
Xu Zhao is a 3rd-year Ph.D. student at the University of Toronto, under the supervision of Prof. Ding Yuan.
His research interest lies in the area of performance of distributed systems and failure diagnosis.

His current work focuses on automated placement of logging statements and non-intrusive performance profiling for distributed systems.cluding BBC, The Register, Dailydot, and others.

Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach

Monday October 23rd, 12-1PM @ BA5205

Speaker: Yongle Zhang

Title:
Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach

Abstract:
Complex and unforeseen failures in distributed systems must be diagnosed and replicated in a development environment so that developers can understand the underlying problem and verify the resolution. System logs often form the only source of diagnostic information, and developers reconstruct a failure using manual guesswork. This is an unpredictable and time-consuming process which can lead to costly service outages while a failure is repaired.

This paper describes Pensieve, a tool capable of reconstructing near-minimal failure reproduction steps from log les and system bytecode, without human involvement. Unlike existing solutions that use symbolic execution to search for the entire path leading to the failure, Pensieve is based on the Partial Trace Observation, which states that programmers do not simulate the entire execution to understand the failure, but follow a combination of control and data dependencies to reconstruct a simplified trace that only contains events that are likely to be relevant to the failure. Pensieve follows a set of carefully designed rules to infer a chain of causally dependent events leading to the failure symptom while aggressively skipping unrelated code paths to avoid the path-explosion overheads of symbolic execution models.

Bio:
Yongle Zhang is a PhD student at University of Toronto, working on computer systems and reliability with Prof. Ding Yuan. His current research focuses on system profiling and failure diagnosis.