AutoFix: Automated Fixing of Programs

Overview

In program debugging, finding a failure is only the first step; what about locating and correcting the fault? Can we automate the whole process? The AutoFix project targets the automatic generation and validation of fixes for software faults. As program faults are discrepancies between the implementation and the specification, a fix may involve changes to the implementation, the specification, or even both. In view of this, AutoFix exploits dynamic program analysis techniques to understand the fault cause, synthesize potential fixes to either the implementation or the contracts, and validate the synthesized fixes.  In our experimental evaluations, AutoFix was able to automatically propose high quality fixes to real world program faults.

AutoFix is a joint project between the Chair of Software Engineering of ETH Zurich and the Software Engineering Chair of Saarland University.

Fixes to the implementation

Figure 1.a shows a faulty routine in class TWO_WAY_SORTED_SET. The routine move_item takes an argument v, which must be a non-Void element stored in the set, and moves it to the position to the left of the internal cursor (attribute index). To this end, the routine saves the cursor position into a local variable idx, moves index to where v occurs in the set, removes this occurrence of v, restores index to its initial position, and inserts v to the left of the cursor. The implementation shown fails with a violation of go_i_th(idx)'s precondition whenever the internal cursor initially points to the last valid cursor position count + 1 (see Figure 2). In fact, the removal of one element in the set invalidates the position stored in idx, expressible as count + 2 with respect to the value of count after removal. A closer look reveals that the routine is incorrect even in cases when this precondition violation does not occur: if v appears to the left of the internal cursor initially, the sequence of operations moves v to an incorrect "off-by-one" position. Figure 1.b shows a fix suggested by AutoFix; the fix corrects the routine even for the cases when it does not trigger a precondition violation invoking go_i_th(idx).

    move_item (v: G)
            -- Move `v' to the left of cursor.
        require
            v /= Void
            has (v)
        local
            idx: INTEGER
            found: BOOLEAN
        do
            idx := index
            from start until found or after loop
                found := (v = item)
                if not found then forth end
            end
            check found and not after end
            remove

            go_i_th (idx)
            put_left (v)

        end

  go_i_th (i: INTEGER) require 0 <= i <= count + 1
  put_left (v: G) require not before
  before: BOOLEAN do Result := (index = 0) end

move_item (v: G)
            -- Move `v' to the left of cursor.
        require
            v /= Void
            has (v)
        local
            idx: INTEGER
            found: BOOLEAN
        do
            idx := index
            from start until found or after loop
                found := (v = item)
                if not found then forth end
            end
            check found and not after end
            remove
            if index < idx then idx := idx - 1 end
            go_i_th (idx)
            put_left (v)
end

  go_i_th (i: INTEGER) require 0 <= i <= count + 1
  put_left (v: G) require not before
  before: BOOLEAN do Result := (index = 0) end
(a) Faulty routine
(b) Fixed routine
Figure 1. A real fault from TWO_WAY_SORTED_SET.move_item

The effect of calling remove in move_item when index = count + 1 and after hold initially
Figure 2. The effect of calling remove in move_item when index  = count + 1 and after hold initially: the following invocation of go_i_th(idx) triggers a precondition violation.

Figure 3 depicts the steps AutoFix takes to propose fixes to the implementation. Particularly, AutoFix runs AutoTest to automatically generate passing and failing test cases for the input Eiffel classes, extract a set of expressions from the text of the classes' routines, and compute the expression dependence score edep between expressions—measuring their syntactic similarity—and the control dependence score cdep between program locations—measuring  how close they are on the control-flow graph. It then computes the dynamic score dyn between expressions, measuring how more often an expression is mentioned in failing rather than passing test cases, and combines the three scores (edep, cdep, dyn) into the score fixme, which determines a global ranking of expressions. In the end, AutoFix enumerates possible fixing actions for the expressions with highest fixme score, and generates candidate fixes by injecting the fix actions into the faulty routine. The candidate fixes that pass all the regression test suite are considered valid.


Fixing process of AutoFix
Figure 3: How AutoFix works.

We applied AutoFix to over 200 faults in four code bases, including both open-source software developed by professionals and student projects of various quality. AutoFix successfully fixed 86 (or 42%) of the faults; inspection shows that 51 of these fixes are genuine corrections of quality comparable to those competent programmers would write. More information about the experimental evaluation of AutoFix can be found in our publication in TSE 2014.

Fixes to the specification

While most debugging techniques focus on patching implementations, there are numerous bugs whose most appropriate corrections consist in fixing the specification to prevent invalid executions---such as to define the correct input domain of a function. We present here a fully automatic technique that fixes bugs by proposing changes to contracts (simple executable specification elements such as pre- and postconditions). The technique relies on dynamic analysis to understand the source of buggy behavior, to infer changes to the contracts that emend the bugs, and to validate the changes against general usage. We have implemented the technique in a tool called AutoFix, which works on programs written in Eiffel, and evaluated it on 44 bugs found in standard data-structure libraries. Manual analysis by human programmers found that AutoFix suggested repairs that are directly deployable for 25% of the faults; in most cases, these contract repairs were preferred over fixes for the same bugs that change the implementation.

Let us briefly demonstrate how AutoFix propose fixes to contracts using an example. The example targets a bug of routine (method) duplicate in class CIRCULAR,which is the standard Eiffel library implementation of circular array-based lists. To understand the bug, Figure 4 illustrates a few details of CIRCULAR's API. Lists are numbered from index 1 to index count (an attribute denoting the list length), and include an internal cursor that may point to any element of the list. Routine duplicate takes a single integer argument n, which denotes the number of elements to be copied; called on a list object list, it returns a new instance of CIRCULAR with at most n elements copied from list starting from the position pointed to by cursor. Since we are dealing with circular lists, the copy wraps over to the first element. For example, calling duplicate (3) on the list in Figure 5 returns a fresh list with elements <C, D, A> in this order.
1    class CIRCULAR [G]
2   
3         make (m: INTEGER)
4             require m >= 1
5             do ... end
6   
7         duplicate (n: INTEGER): CIRCULAR [G]
8             do
9                 create Result.make (count)
10                 ...
11          end
12
13         count: INTEGER -- Length of list



CIRCULAR implementation using lists.
 
Figure 4. Some implementation details of CIRCULAR.
Figure 5. A circular list of class CIRCULAR: the internal cursor points to the element C at index 3.
The implementation of duplicate is straightforward: it creates a fresh CIRCULAR object Result (line 9 in Figure 4); it iteratively copies n elements from the current list into Result; and it finally returns the list attached to Result. The call to the creation procedure (constructor) make on line 9 allocates space for a list with count elements; this is certainly sufficient, since Result cannot contain more elements than the list that is duplicated. However, CIRCULAR’s creation procedure make includes a precondition (line 4 in Figure 4) that only allows allocating lists with space for at least one element (require m ≥ 1). This sets off a bug when duplicate is called on an empty list: count is 0, and hence the call on line 9 triggers a violation of make’s precondition. Testing tools such as AutoTest detect this bug automatically by providing a concrete test case that exposes the discrepancy between implementation and specification.

make (m: INTEGER)
   require m ≥ 1

duplicate (n: INTEGER):CIRCULAR [G]
   do
      if count > 0 then
         create Result.make (count)
      else
         create Result.make (1)
      end

      ...
make (m: INTEGER)
   require m ≥ 1

duplicate (n: INTEGER):CIRCULAR [G]
   require count > 0
   do
      create Result.make (count)
      ...
make (m: INTEGER)
 require m ≥ 0

duplicate (n: INTEGER): CIRCULAR [G]
   do
      create Result.make (count)
      ...
(a) Patching the implementation. (b) Strengthening the specification. (c) Weakening the specification.
Figure 6: Three different fixes for the bug of Figure 4.

How should we fix this bug? Figure 6 shows three different possible repairs, all of which we can generate completely automatically. An obvious choice is patching duplicate’s implementation as shown in Figure 6a: if count is 0 when duplicate is invoked, allocate Result with space for one element; this satisfies make’s precondition in all cases. Our AutoFix tool targets fixes of implementations and in fact suggests the patch in Figure 6a. The fix that changes the implementation is acceptable, since it makes duplicate run correctly, but it is not entirely satisfactory: CIRCULAR’s implementation looks perfectly adequate, whereas the ultimate source of failure seems to be incorrect or inadequate specification. A straightforward fix is then adding a precondition to duplicate that only allows calling it on non-empty lists. Figure 6b shows such a fix, which strengthens duplicate’s precondition thus invalidating the test case exposing the bug. The strengthening fix has the advantage of being textually simpler than the implementation fix, and hence also probably simpler for programmers to understand. However, both fixes in Figures 6a and 6b are partial, in that they remove the source of faulty behavior in duplicate but they do not prevent similar faults—deriving from calling make with m = 0—from happening. A more critical issue with the specification-strengthening fix in Figure 6b is that it may break clients of CIRCULAR that rely on the previous weaker precondition. There are cases—such as when computing the maximum of an empty list—where strengthening produces the most appropriate fixes; in the running example, however, strengthening arguably is not the optimal strategy.

A look at make’s implementation (not shown in Figure 4) would reveal that the creation procedure’s precondition m ≥ 1 is unnecessarily restrictive, since the routine body works as expected also when executed with m = 0. This suggests a fix that weakens make’s precondition as shown in Figure 6c. This is arguably the most appropriate correction to the bug of duplicate: it is very simple, it fixes the specific bug as well as similar ones originating in creating an empty list, and it does not invalidate any clients of CIRCULAR’s API. The AutoFix tool described in this paper generates both specification fixes in Figures 6b and 6c but ranks the weakening fix higher than the strengthening one. More generally, AutoFix outputs specification-strengthening fixes only when they do not introduce bugs in available tests, and it always prefers the least restrictive fixes among those that are applicable.

AutoFix works completely automatically: its only input is an Eiffel program annotated with simple contracts (pre- and postconditions and class invariants) which constitute its specification. After going through the steps described in the rest of this section, SpecFix’s final output is a list of fix suggestions for the bugs in the input program. Figure 7 gives an overview of the components of the AutoFix technique. AutoFix is based on dynamic analysis, and hence it characterizes correct and incorrect behavior by means of passing and failing test cases. To provide full automation, we use the random testing framework AutoTest to generate the tests used by AutoFix. The core of the fix generation algorithm applies two complementary strategies: weaken (i.e., relax) a violated contract if it is needlessly restrictive; or strengthen an existing contract to rule out failure-inducing inputs. AutoFix produces candidate fixes using both strategies, possibly in combination. To determine whether the weaker or stronger contracts remove all faulty behavior in the program, AutoFix runs candidate fixes through a validation phase based on all available tests. To avoid overfitting, some tests are generated initially but used only in the validation phase (and not directly to generate fixes). If multiple fixes for the same fault survive the validation phase, AutoFix outputs them to the user ordered according to the strength of their new contracts: weaker contracts are more widely applicable, and hence are ranked higher than more restrictive stronger contracts.

An overview of how SpeciFix works.
Figure 7.  An overview of how AutoFix works. 

We performed a preliminary evaluation of the behavior of AutoFix by applying it to 44 bugs of production software. The overall goal of the evaluation is corroborating the expectation that, for bugs whose “most appropriate” correction is fixing the specification, AutoFix can produce repair suggestions of good quality. A more detailed evaluation taking into account aspects such as robustness and readability of the produced fixes belongs to future work.

All the experiments ran on a Windows 7 machine with a 2.6 GHz Intel 4-core CPU and 16 GB of memory. We selected 10 of the most widely used data-structure classes of the EiffelBase (rev. 92914) and Gobo (rev. 91005) libraries—the two major Eiffel standard libraries. We ran AutoTest for one hour on each of the 10 classes. This automatic testing session found 44 unique faults consisting of pre- or postcondition violations. We ran AutoFix on each of these faults individually, using half of the test cases (generated for each fault in the one-hour session) to generate the fixes and the other half in the validation phase. AutoFix produced valid fixes for 42 faults and proper fixes (i.e. fixes that correct the fault without introducing new faults) for 11. The average figures per fault are: 106.4 minutes of testing time and 7.2 minutes of fixing time (minimum: 4.1 minutes, maximum: 30 minutes, median 6.2 minutes). The testing time dominates since AutoTest operates randomly and thus generates many test cases that will not be used (such as passing tests of routines without faults).

User's manual

The user's manual of AutoFix is available here.

Downloads

AutoFix is part of the Eve integrated verification environment, the research branch of EiffelStudio.

Funding

The following subjects have contributed to funding the AutoFix project:

Publications

Contact