Revised version of paper Dauvergne/Hascoet ICCS 2006

6 Feb 2006

      Please find attached:

   -- The revised version of our ICCS 2006 paper
     as a tar.gz file with the latex source,
     2 postscript figures and a .bib bibliography

   -- The concatenation of the three referees reports,
     with our replies inserted inside.

The paper suggests a formal approach to characterize
checkpointing sets vs taping. Both the application to 
programs that do not exhibit a special looping structure 
as well as the characterization of minimal solutions are new.
It is a very usefull foray into using data flow for 
determining checkpointing sets and should definitly be 
accepted.  
The issues mentioned below should improve readability 
for the general audience. 

major comments: 

A:
I realize notation regarding the result of 
the source transformation of program subsections in 
terms of forward and reverse is difficult 
because it does not lend itself easily to checkpointing schemes. 
E.g. on pg.3 Sect.3 states: 
 $\overline{C}\Doteq\overrightarrow{C};\overleftarrow{C}$
and in this notation there is no mentioning of taking 
the checkpoint, running forward and later restoring the checkpoint
although this certainly belongs to the source transformation of 
code section $C$ which is presumably what is denoted by $\overline{C}$
if one goes by the explanation of $P$ and $\overline{P}$ on page 2. 
    Then on the top of page 4 there is a different notation that does
includes the store/restore etc.
...
...
Right. I added notation [] to denote checkpointing.
\overline{C;D} is thus
\overrightarrow{C};\overrightarrow{D};\overleftarrow{D};\overleftarrow{C}
whereas \overline{[C];D} is the reverse diff program with C checkpointed
with the store, restore, etc.
However,  on page 4 I am missing 
an explanation of the notation "\hbox{\it Req} \vdash"
which is repeatedly used as something being executed while 
"Req" itself is a set. The notation should be clarified.
...
...
Req is not a piece of code. It is a context in which
the code generation is done. When using rewrite rules,
people use \vdash to mean this.
Yes the rewrite rules here are ambiguous.
I put a box around to distinguish the rewrite terms
plus one line of explanation.
B: 
I don't see a clear motivation for the backward snapshot 
as opposed to the tape. In the paper just  states it is a "hybrid". 
The need for the regular checkpoints is clear.
...
...
Backward snapshot does not save memory space,
but may reduce memory traffic (e.g only one
push/pop pair instead of 2. Added this on page 4.
Sect 4. mentions that there can be a trade-off between 
storing/restoring things in big blocks vs. 
doing it one at a time. This would presumably be the implied 
motivation for the backward snapshots. 
It would be nice if the conclusions section could address this by 
showing more detail about profile data for taping push/pops 
vs. storing/restoring snapshots.
...
...
Our first measurements on memory traffic are too recent,
and not conclusive yet.
C:
I missed any mentioning of checkpoint placement as an issue. 
The space constraints don't allow to go into much detail. 
However, it should be pointed out that (aside from special 
cases such as time stepping loops) the segmentation of the code 
into the subsections U,C,D  
which in this paper is predetermined can have a substantial 
impact on the checkpoint size.
...
...
I took your words to rephrase the end of the conclusion,
which meant that.
minor comments: 

pg 2:  If there isn't a special purpose for calling the three parts 
  upstream/center/downstream, could they be renamed A/B/C instead?
...
...
I sort of like using U for Upstream, C for Checkpoint, D for Downstream ?
pg 2: Can you cite the TBR paper at the first mentioning of TBR       
  analysis .
...
...
Done
pg 2: change "... finds the Req set of these required values." to 
  "...finds the set of these required values denoted by Req."
...
...
ok
May be there is a chance to also actually 
  say what precisely "out()" and "use()" mean, in particular 
  if it is 'local use' only of cumulative with the 'use' of 
  a segments 'callees'. To the uninitiated 
  reader it is probably not clear enough from the context.
...
...
added a few words on bottom of page 4
pg 2: the sentence 
  "In other words ... must be built in such a way that:"
   with the following formula should either be removed or better 
   explained. The assertion "must be built" wrt. the formula
   becomes clear only after reading on and going back later.
...
...
Good. We gain 3 lines!
pg 3: I think the sentence 
  "This also uses memory space, although less than the tape for ..."
  should really relate to the *placement* of the checkpoints, i.e. 
  the segmentation of the program. That is crucial and should be 
  mentioned to better motivate above sentence and the initial
  trade-off between taping and checkpointing space.
...
...
agreed.
pg 5: in the condition for "\it Snp"  (3rd last line), 
   it is not immediately apparent to me where the 
    \cup (\hbox{\it Req} \setminus \hbox{\it Sbk}
   comes from. I suspect this is just my mistake but 
   without that last bit it appears 
   to be directly derived from eqn. (3). With 
   it the intersection on the right hand side may be larger 
   than without which makes me wonder 
   about the "necessary and sufficient"? Since I have 
   a similar problem with the Req_D eqn. I am just 
   guessing I am missing some step but you may want to 
   consider adding this.
...
...
This is because Snp also appears in
equation (4). Equation (3) alone does imply that
Snp >= (out(C) \cup (out(Db) \ ReqD)) \cap use(Cb)
in addition (4) implies that
Snp >= (out(C) \cup (out(Db) \ ReqD)) \cap (Req \ Sbk)
same happens for ReqD.
We do think the conditions are necessary and sufficient.
pg 6: In the solution of the set equations - what is the 
   optimality criterion? The text mentions "minimal" and 
   "optimal" solutions. Naively looking at the equations in (6) 
   it seems Opt_2^- is the only part occurring twice so any 
   partition that leaves Opt_2^- empty seems to be better 
   in terms of the total size of  Sbk, Snp, Req_D, Req_C ?
   So I suspect total memory footprint isn't the criterion.
...
...
Right. I added that these sets are minimal in the sense that any
semantics-preserving choice of the 4 sets is
larger or equal to one of these minimal quadruplets.
pg 8: Please clarify "... choices of the other checkpoints around." 
   What is around? It could be that once the notation issue raised 
   under the major issue "A" is addressed this may well become clear.
...
...
Rephrased.
--------------------------------------------------------------

[Overview]
More explanation on notation is needed.
...
...
added explanations on the \overline notation
and the rewrite rule on page 4
However, we will also have to spare half a page,
so sometimes we have to rely on commonly
used notation.
[Comments]
page 2, line -8 (8 from the bottom)
  The term "out" is not defined here.
...
...
removed and explained on next occurrence page 4
page 3, line 11
  The meaning of the semicolon in the notation
    $\overline{D}\Doteq\overrightarrow{D};\overleftarrow{D}$
  should be explained.
...
...
done. This use of semicolon for code sequence is widespread.
page 3, Figure 2
  The term "Sbk" is not defined here.
...
...
Right, but it is defined upon its first occurrence in the text
page 4, line 3
  The notation "$Req\vdash \overline{C;D}$" was hard to be understood
  because PUSH(Sbk) is undefined here.

page 5, line 15 ( eq.(3)) 
  The reason of disappearance of "PUSH(Sbk)" in (1) should be explained.
...
...
done. A PUSH does not overwrite any program variable.
--------------------------------------------

1. Introduction

You write: ... it (reverse mode AD) suffers from a high memory consumption...
  That is only true for the store all or store minimal strategy
  implemented in Tapenade, ADIFOR 3, and OpenAD.
  If every intermediate result is recomputed then reverse mode AD needs little
  memory but suffers from computational costs!
  You might want to read and cite other papers about AD.

You write: This memory consumption is inherent to the nature of
           gradient computation.
  That is obviously wrong.

You write: to solve the adjoint equations ... is also known to consume
           memory space.
  What you mean by this. Almost every computation requires memory
  (it doesn't consume it like eating bred).
  AD generated code also solves the adjoint equation!

The first two paragraphs in the introduction need to be rewritten totally.
...
...
Rewritten as:
The methods to compute gradients can be classified in two categories.
In the first category, methods use CPU more intensively because
several operations are duplicated. This can be through
repeated tangent directional derivatives, or through reverse
Automatic Differentation using the ``Recompute-All'' strategy.
This is not the context of this paper. In the second category,
methods spare duplicated operations through increased memory use.
This encompasses hand-coded resolution of the ``adjoint equations''
and reverse Automatic Differentation using the ``Store-All'' strategy,
which is the context of this work.
2. Reverse Automatic Differentiation

You write: AD is a program transformation technique.
           ... an AD tool creates a new program ...
  no, see tools based on operator overloading
...
...
ok. Text modified.
TBR = to be recorded not to be restored
...
...
Right. Fixed.
What is the definition of 'Req'?
...
...
Explained in section 2. Rephrased.
What is the difference between Req and use(U)?
...
...
In the paper we define Req as use(\overleftarrow{U}),
which is smaller than use(U) as said in section 2 and
detailed in refs [2,6]
What is the definition of OUT?
I guess what you name OUT is actually termed KILL in data flow analysis.
...
...
Defined now page 4. KILL usually denotes scalar and array
variables completely overwritten. OUT is for variables even
only partly overwritten.
3. The equation of checkpointing snapshots

You write: This allows the tape consumed ... to be freed ...
  That is also true without checkpointing.
  The tape consumed by C is relevant here,
  it is needed after adjoint of D has finished.
...
...
Both what you write and what we wrote is true,
and both lead to the conclusion stated by
the sentence that follows.
What is Sbk?
...
...
Defined in detail on its first occurrence in the text, page 4.
True it appears on figure 2 before, but it is just a
support for the discussion on page 4.
You write: This also uses memory space, although less than the tape for C.
  That is not always true.
...
...
Ok, but one can almost always write pathological examples
for any such assumption. In real life, the snapshot is really
smaller. It's an interesting question to detect those "C"'s
for which snapshots are larger than the tape, but
it's not in the scope of this paper.
4 Discussion and Experimental results

Table 1: What run-time is shown, the function or the adjoint?
One needs to see the run-time of function, eager adjoint, and lazy adjoint.
What is the memory usage and run-time without checkpointing?
...
...
added run-time of function.
Run-times with eager and lazy do not differ significantly.
Most examples would simply run out of memory if no
checkpointing was made.
5 Conclusions

I am missing an discussion of the computational cost of checkpointing,
i.e. the additional flops and the cost of writing and reading the tape.
The central question is what is the best mixture of recomputations and
taping, not memory usage alone.
...
...
Certainly a very general and important question.
This paper does not claim to answer this question.
But it gives an element to understand it better.

Laurent Hascoet

tags

participants (1)