[02/49] analyzer: internal documentation

Message ID 1573867416-55618-3-git-send-email-dmalcolm@redhat.com
State Superseded
Headers show
  • RFC: Add a static analysis framework to GCC
Related show

Commit Message

David Malcolm Nov. 16, 2019, 1:22 a.m.
	* Makefile.in (TEXI_GCCINT_FILES): Add analyzer.texi.
	* doc/analyzer.texi: New file.
	* doc/gccint.texi ("Static Analyzer") New menu item.
	(analyzer.texi): Include it.
 gcc/Makefile.in       |   2 +-
 gcc/doc/analyzer.texi | 470 ++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/doc/gccint.texi   |   2 +
 3 files changed, 473 insertions(+), 1 deletion(-)
 create mode 100644 gcc/doc/analyzer.texi



diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index c82858f..53d74e2 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -3182,7 +3182,7 @@  TEXI_GCCINT_FILES = gccint.texi gcc-common.texi gcc-vers.texi		\
 	 gnu.texi gpl_v3.texi fdl.texi contrib.texi languages.texi	\
 	 sourcebuild.texi gty.texi libgcc.texi cfg.texi tree-ssa.texi	\
 	 loop.texi generic.texi gimple.texi plugins.texi optinfo.texi   \
-	 match-and-simplify.texi ux.texi poly-int.texi
+	 match-and-simplify.texi analyzer.texi ux.texi poly-int.texi
 TEXI_GCCINSTALL_FILES = install.texi install-old.texi fdl.texi		\
 	 gcc-common.texi gcc-vers.texi
diff --git a/gcc/doc/analyzer.texi b/gcc/doc/analyzer.texi
new file mode 100644
index 0000000..8e7b26f
--- /dev/null
+++ b/gcc/doc/analyzer.texi
@@ -0,0 +1,470 @@ 
+@c Copyright (C) 2019 Free Software Foundation, Inc.
+@c This is part of the GCC manual.
+@c For copying conditions, see the file gcc.texi.
+@c Contributed by David Malcolm <dmalcolm@redhat.com>.
+@node Static Analyzer
+@chapter Static Analyzer
+@cindex analyzer
+@cindex static analysis
+@cindex static analyzer
+* Analyzer Internals::       Analyzer Internals
+* Debugging the Analyzer::   Useful debugging tips
+@end menu
+@node Analyzer Internals
+@section Analyzer Internals
+@cindex analyzer, internals
+@cindex static analyzer, internals
+@subsection Overview
+The analyzer implementation works on the gimple-SSA representation.
+(I chose this in the hopes of making it easy to work with LTO to
+do whole-program analysis).
+The implementation is read-only: it doesn't attempt to change anything,
+just emit warnings.
+First, we build a @code{supergraph} which combines the callgraph and all
+of the CFGs into a single directed graph, with both interprocedural and
+intraprocedural edges.  The nodes and edges in the supergraph are called
+``supernodes'' and ``superedges'', and often referred to in code as
+@code{snodes} and @code{sedges}.  Basic blocks in the CFGs are split at
+interprocedural calls, so there can be more than one supernode per
+basic block.  Most statements will be in just one supernode, but a call
+statement can appear in two supernodes: at the end of one for the call,
+and again at the start of another for the return.
+The supergraph can be seen using @option{-fdump-analyzer-supergraph}.
+We then build an @code{analysis_plan} which walks the callgraph to
+determine which calls might be suitable for being summarized (rather
+than fully explored) and thus in what order to explore the functions.
+Next is the heart of the analyzer: we use a worklist to explore state
+within the supergraph, building an "exploded graph".
+Nodes in the exploded graph correspond to <point,@w{ }state> pairs, as in
+     "Precise Interprocedural Dataflow Analysis via Graph Reachability"
+     (Thomas Reps, Susan Horwitz and Mooly Sagiv).
+We reuse nodes for <point, state> pairs we've already seen, and avoid
+tracking state too closely, so that (hopefully) we rapidly converge
+on a final exploded graph, and terminate the analysis.  We also bail
+out if the number of exploded <end-of-basic-block, state> nodes gets
+larger than a particular multiple of the total number of basic blocks
+(to ensure termination in the face of pathological state-explosion
+cases, or bugs).  We also stop exploring a point once we hit a limit
+of states for that point.
+We can identify problems directly when processing a <point,@w{ }state>
+instance.  For example, if we're finding the successors of
+   <point: before-stmt: "free (ptr);",
+    state: @{"ptr": freed@}>
+@end smallexample
+then we can detect a double-free of "ptr".  We can then emit a path
+to reach the problem by finding the simplest route through the graph.
+Program points in the analysis are much more fine-grained than in the
+CFG and supergraph, with points (and thus potentially exploded nodes)
+for various events, including before individual statements.
+By default the exploded graph merges multiple consecutive statements
+in a supernode into one exploded edge to minimize the size of the
+exploded graph.  This can be suppressed via
+The fine-grained approach seems to make things simpler and more debuggable
+that other approaches I tried, in that each point is responsible for one
+Program points in the analysis also have a "call string" identifying the
+stack of callsites below them, so that paths in the exploded graph
+correspond to interprocedurally valid paths: we always return to the
+correct call site, propagating state information accordingly.
+We avoid infinite recursion by stopping the analysis if a callsite
+appears more than @code{analyzer-max-recursion-depth} in a callstring
+(defaulting to 2).
+@subsection Graphs
+Nodes and edges in the exploded graph are called ``exploded nodes'' and
+``exploded edges'' and often referred to in the code as
+@code{enodes} and @code{eedges} (especially when distinguishing them
+from the @code{snodes} and @code{sedges} in the supergraph).
+Each graph numbers its nodes, giving unique identifiers - supernodes
+are referred to throughout dumps in the form @samp{SN': @var{index}} and
+exploded nodes in the form @samp{EN: @var{index}} (e.g. @samp{SN: 2} and
+The supergraph can be seen using @option{-fdump-analyzer-supergraph-graph}.
+The exploded graph can be seen using @option{-fdump-analyzer-exploded-graph}
+and other dump options.  Exploded nodes are color-coded in the .dot output
+based on state-machine states to make it easier to see state changes at
+a glance.
+@subsection State Tracking
+There's a tension between:
+@itemize @bullet
+precision of analysis in the straight-line case, vs
+exponential blow-up in the face of control flow.
+@end itemize
+For example, in general, given this CFG:
+      A
+     / \
+    B   C
+     \ /
+      D
+     / \
+    E   F
+     \ /
+      G
+@end smallexample
+we want to avoid differences in state-tracking in B and C from
+leading to blow-up.  If we don't prevent state blowup, we end up
+with exponential growth of the exploded graph like this:
+           1:A
+          /   \
+         /     \
+        /       \
+      2:B       3:C
+       |         |
+      4:D       5:D        (2 exploded nodes for D)
+     /   \     /   \
+   6:E   7:F 8:E   9:F
+    |     |   |     |
+   10:G 11:G 12:G  13:G    (4 exploded nodes for G)
+@end smallexample
+Similar issues arise with loops.
+To prevent this, we follow various approaches:
+@enumerate a
+state pruning: which tries to discard state that won't be relevant
+later on withing the function.
+This can be disabled via @option{-fno-analyzer-state-purge}.
+state merging.  We can try to find the commonality between two
+program_state instances to make a third, simpler program_state.
+We have two strategies here:
+  @enumerate
+  @item
+     the worklist keeps new nodes for the same program_point together,
+     and tries to merge them before processing, and thus before they have
+     successors.  Hence, in the above, the two nodes for D (4 and 5) reach
+     the front of the worklist together, and we create a node for D with
+     the merger of the incoming states.
+  @item
+     try merging with the state of existing enodes for the program_point
+     (which may have already been explored).  There will be duplication,
+     but only one set of duplication; subsequent duplicates are more likely
+     to hit the cache.  In particular, (hopefully) all merger chains are
+     finite, and so we guarantee termination.
+     This is intended to help with loops: we ought to explore the first
+     iteration, and then have a "subsequent iterations" exploration,
+     which uses a state merged from that of the first, to be more abstract.
+  @end enumerate
+We avoid merging pairs of states that have state-machine differences,
+as these are the kinds of differences that are likely to be most
+interesting.  So, for example, given:
+      if (condition)
+        ptr = malloc (size);
+      else
+        ptr = local_buf;
+      .... do things with 'ptr'
+      if (condition)
+        free (ptr);
+      ...etc
+@end smallexample
+then we end up with an exploded graph that looks like this:
+                   if (condition)
+                     / T      \ F
+            ---------          ----------
+           /                             \
+      ptr = malloc (size)             ptr = local_buf
+          |                               |
+      copy of                         copy of
+        "do things with 'ptr'"          "do things with 'ptr'"
+      with ptr: heap-allocated        with ptr: stack-allocated
+          |                               |
+      if (condition)                  if (condition)
+          | known to be T                 | known to be F
+      free (ptr);                         |
+           \                             /
+            -----------------------------
+                         | ('ptr' is pruned, so states can be merged)
+                        etc
+@end smallexample
+where some duplication has occurred, but only for the places where the
+the different paths are worth exploringly separately.
+Merging can be disabled via @option{-fno-analyzer-state-merge}.
+@end enumerate
+@subsection Region Model
+Part of the state stored at a @code{exploded_node} is a @code{region_model}.
+This is an implementation of the region-based ternary model described in
+"A Memory Model for Static Analysis of C Programs"}
+(Zhongxing Xu, Ted Kremenek, and Jian Zhang).
+A @code{region_model} encapsulates a representation of the state of
+memory, with a tree of @code{region} instances, along with their associated
+values.  The representation is graph-like because values can be pointers
+to regions.  It also stores a constraint_manager, capturing relationships
+between the values.
+Because each node in the @code{exploded_graph} has a @code{region_model},
+and each of the latter is graph-like, the @code{exploded_graph} is in some
+ways a graph of graphs.
+Here's an example of printing a @code{region_model}, showing the ASCII-art
+used to visualize the region hierarchy (colorized when printing to stderr):
+(gdb) call debug (*this)
+r0: @{kind: 'root', parent: null, sval: null@}
+|-stack: r1: @{kind: 'stack', parent: r0, sval: sv1@}
+|  |: sval: sv1: @{poisoned: uninit@}
+|  |-frame for 'test': r2: @{kind: 'frame', parent: r1, sval: null, map: @{'ptr_3': r3@}, function: 'test', depth: 0@}
+|  |  `-'ptr_3': r3: @{kind: 'map', parent: r2, sval: sv3, type: 'void *', map: @{@}@}
+|  |    |: sval: sv3: @{type: 'void *', unknown@}
+|  |    |: type: 'void *'
+|  `-frame for 'calls_malloc': r4: @{kind: 'frame', parent: r1, sval: null, map: @{'result_3': r7, '_4': r8, '<anonymous>': r5@}, function: 'calls_malloc', depth: 1@}
+|    |-'<anonymous>': r5: @{kind: 'map', parent: r4, sval: sv4, type: 'void *', map: @{@}@}
+|    |  |: sval: sv4: @{type: 'void *', &r6@}
+|    |  |: type: 'void *'
+|    |-'result_3': r7: @{kind: 'map', parent: r4, sval: sv4, type: 'void *', map: @{@}@}
+|    |  |: sval: sv4: @{type: 'void *', &r6@}
+|    |  |: type: 'void *'
+|    `-'_4': r8: @{kind: 'map', parent: r4, sval: sv4, type: 'void *', map: @{@}@}
+|      |: sval: sv4: @{type: 'void *', &r6@}
+|      |: type: 'void *'
+`-heap: r9: @{kind: 'heap', parent: r0, sval: sv2@}
+  |: sval: sv2: @{poisoned: uninit@}
+  `-r6: @{kind: 'symbolic', parent: r9, sval: null, map: @{@}@}
+  sv0: @{type: 'size_t', '1024'@}
+  sv1: @{poisoned: uninit@}
+  sv2: @{poisoned: uninit@}
+  sv3: @{type: 'void *', unknown@}
+  sv4: @{type: 'void *', &r6@}
+constraint manager:
+  equiv classes:
+    ec0: @{sv0 == '1024'@}
+    ec1: @{sv4@}
+  constraints:
+@end smallexample
+This is the state at the point of returning from @code{calls_malloc} back
+to @code{test} in the following:
+void *
+calls_malloc (void)
+  void *result = malloc (1024);
+  return result;
+void test (void)
+  void *ptr = calls_malloc ();
+  /* etc.  */
+@end smallexample
+The ``root'' region (``r0'') has a ``stack'' child (``r1''), with two
+children: a frame for @code{test} (``r2''), and a frame for
+@code{calls_malloc} (``r4'').  These frame regions have child regions for
+storing their local variables.  For example, the return region
+and that of various other regions within the ``calls_malloc'' frame all have
+value ``sv4'', a pointer to a heap-allocated region ``r6''.  Within the parent
+frame, @code{ptr_3} has value ``sv3'', an unknown @code{void *}.
+@subsection Analyzer Paths
+We need to explain to the user what the problem is, and to persuade them
+that there really is a problem.  Hence having a @code{diagnostic_path}
+isn't just an incidental detail of the analyzer; it's required.
+Paths ought to be:
+@itemize @bullet
+@end itemize
+Without state-merging, all paths in the exploded graph are feasible
+(in terms of constraints being satisified).
+With state-merging, paths in the exploded graph can be infeasible.
+We collate warnings and only emit them for the simplest path
+e.g. for a bug in a utility function, with lots of routes to calling it,
+we only emit the simplest path (which could be intraprocedural, if
+it can be reproduced without a caller).  We apply a check that
+each duplicate warning's shortest path is feasible, rejecting any
+warnings for which the shortest path is infeasible (which could lead to
+false negatives).
+We use the shortest feasible path through the exploded_graph (a list of
+edges) to build a @code{diagnostic_path} (a list of events for the
+diagnostic subsystem).
+@subsection Limitations
+@itemize @bullet
+Only for C so far
+The implementation of call summaries is currently very simplistic.
+Lack of function pointer analysis
+The region model code creates lots of little mutable objects at each
+@code{region_model} (and thus per @code{exploded_node}) rather than
+sharing immutable objects and having the mutable state in the
+@code{program_state} or @code{region_model}.  The latter approach might be
+more efficient, and might avoid dealing with IDs rather than pointers
+(which requires us to impose an ordering to get meaningful equality).
+The region model code doesn't yet support @code{memcpy}.  At the
+gimple-ssa level these have been optimized to statements like this:
+_10 = MEM <long unsigned int> [(char * @{ref-all@})&c]
+MEM <long unsigned int> [(char * @{ref-all@})&d] = _10;
+@end smallexample
+Perhaps they could be supported via a new @code{compound_svalue} type.
+There are various other limitations in the region model (grep for TODO/xfail
+in the testsuite).
+The constraint_manager's implementation of transitivity is currently too
+expensive to enable by default and so must be manually enabled via
+The checkers are currently hardcoded and don't allow for user extensibility
+(e.g. adding allocate/release pairs).
+Although the analyzer's test suite has a proof-of-concept test case for
+LTO, it will crash on anything non-trivial.  There are various
+lang-specific things in the analyzer that assume C rather than LTO.
+For example, casts are currently implemented via @code{convert}, and this
+leads to an ICE on LTO.  Similarly, SSA names are printed to the user in
+``raw'' form, rather than printing the underlying variable name.
+@end itemize
+Some ideas for other checkers
+@itemize @bullet
+File-descriptor-based APIs
+Linux kernel internal APIs
+Signal handling
+@end itemize
+@node Debugging the Analyzer
+@section Debugging the Analyzer
+@cindex analyzer, debugging
+@cindex static analyzer, debugging
+@subsection Builtins for Debugging the Analyzer
+  __analyzer_break ();
+@end smallexample
+to the source being analyzed to trigger a breakpoint in the analyzer when
+that source is reached.  By putting a series of these in the source, it's
+much easier to effectively step through the program state as it's analyzed.
+__analyzer_dump ();
+@end smallexample
+will dump the copious information about the analyzer's state each time it
+reaches the call in its traversal of the source.
+__analyzer_dump_path ();
+@end smallexample
+will emit a placeholder ``note'' diagnostic with a path to that call site,
+if the analyzer finds a feasible path to it.
+The builtin @code{__analyzer_dump_exploded_nodes} will dump information
+after analysis on all of the exploded nodes at that program point:
+  __analyzer_dump_exploded_nodes (0);
+@end smallexample
+will dump just the number of nodes, and their IDs.
+  __analyzer_dump_exploded_nodes (1);
+@end smallexample
+will also dump all of the states within those nodes.
+@code{region_model::get_value_by_name} can be used when inserting custom
+debugging code (e.g. adding it @code{region_model::validate} to detect the
+point at which a named variable acquires an unexpected state).
+@subsection Other Debugging Techniques
+One approach when tracking down where a particular bogus state is
+introduced into the @code{exploded_graph} is to add custom code to
+For example, this custom code breaks with an assertion failure when a
+variable called @code{ptr} acquires a value that's unknown:
+    /* Find a variable matching "ptr".  */
+    svalue_id sid = get_value_by_name ("ptr");
+    if (!sid.null_p ())
+      @{
+	svalue *sval = get_svalue (sid);
+	gcc_assert (sval->get_kind () != SK_UNKNOWN);
+      @}
+@end smallexample
+making it easier to investigate further in a debugger when this occurs.
diff --git a/gcc/doc/gccint.texi b/gcc/doc/gccint.texi
index 99ec005..9dd935d 100644
--- a/gcc/doc/gccint.texi
+++ b/gcc/doc/gccint.texi
@@ -125,6 +125,7 @@  Additional tutorial information is linked to from
 * LTO::             Using Link-Time Optimization.
 * Match and Simplify:: How to write expression simplification patterns for GIMPLE and GENERIC
+* Static Analyzer:: Working with the static analyzer.
 * User Experience Guidelines:: Guidelines for implementing diagnostics and options.
 * Funding::         How to help assure funding for free software.
 * GNU Project::     The GNU Project and GNU/Linux.
@@ -163,6 +164,7 @@  Additional tutorial information is linked to from
 @include plugins.texi
 @include lto.texi
 @include match-and-simplify.texi
+@include analyzer.texi
 @include ux.texi
 @include funding.texi