Revised version of paper Dauvergne/Hascoet ICCS 2006
Please find attached: -- The revised version of our ICCS 2006 paper as a tar.gz file with the latex source, 2 postscript figures and a .bib bibliography -- The concatenation of the three referees reports, with our replies inserted inside. The paper suggests a formal approach to characterize checkpointing sets vs taping. Both the application to programs that do not exhibit a special looping structure as well as the characterization of minimal solutions are new. It is a very usefull foray into using data flow for determining checkpointing sets and should definitly be accepted. The issues mentioned below should improve readability for the general audience. major comments: A: I realize notation regarding the result of the source transformation of program subsections in terms of forward and reverse is difficult because it does not lend itself easily to checkpointing schemes. E.g. on pg.3 Sect.3 states: $\overline{C}\Doteq\overrightarrow{C};\overleftarrow{C}$ and in this notation there is no mentioning of taking the checkpoint, running forward and later restoring the checkpoint although this certainly belongs to the source transformation of code section $C$ which is presumably what is denoted by $\overline{C}$ if one goes by the explanation of $P$ and $\overline{P}$ on page 2. Then on the top of page 4 there is a different notation that does includes the store/restore etc.
Right. I added notation [] to denote checkpointing. \overline{C;D} is thus \overrightarrow{C};\overrightarrow{D};\overleftarrow{D};\overleftarrow{C} whereas \overline{[C];D} is the reverse diff program with C checkpointed with the store, restore, etc.
However, on page 4 I am missing an explanation of the notation "\hbox{\it Req} \vdash" which is repeatedly used as something being executed while "Req" itself is a set. The notation should be clarified.
Req is not a piece of code. It is a context in which the code generation is done. When using rewrite rules, people use \vdash to mean this. Yes the rewrite rules here are ambiguous. I put a box around to distinguish the rewrite terms plus one line of explanation.
B: I don't see a clear motivation for the backward snapshot as opposed to the tape. In the paper just states it is a "hybrid". The need for the regular checkpoints is clear.
Backward snapshot does not save memory space, but may reduce memory traffic (e.g only one push/pop pair instead of 2. Added this on page 4.
Sect 4. mentions that there can be a trade-off between storing/restoring things in big blocks vs. doing it one at a time. This would presumably be the implied motivation for the backward snapshots. It would be nice if the conclusions section could address this by showing more detail about profile data for taping push/pops vs. storing/restoring snapshots.
Our first measurements on memory traffic are too recent, and not conclusive yet.
C: I missed any mentioning of checkpoint placement as an issue. The space constraints don't allow to go into much detail. However, it should be pointed out that (aside from special cases such as time stepping loops) the segmentation of the code into the subsections U,C,D which in this paper is predetermined can have a substantial impact on the checkpoint size.
I took your words to rephrase the end of the conclusion, which meant that.
minor comments: pg 2: If there isn't a special purpose for calling the three parts upstream/center/downstream, could they be renamed A/B/C instead?
I sort of like using U for Upstream, C for Checkpoint, D for Downstream ?
pg 2: Can you cite the TBR paper at the first mentioning of TBR analysis .
Done
pg 2: change "... finds the Req set of these required values." to "...finds the set of these required values denoted by Req."
ok
May be there is a chance to also actually say what precisely "out()" and "use()" mean, in particular if it is 'local use' only of cumulative with the 'use' of a segments 'callees'. To the uninitiated reader it is probably not clear enough from the context.
added a few words on bottom of page 4
pg 2: the sentence "In other words ... must be built in such a way that:" with the following formula should either be removed or better explained. The assertion "must be built" wrt. the formula becomes clear only after reading on and going back later.
Good. We gain 3 lines!
pg 3: I think the sentence "This also uses memory space, although less than the tape for ..." should really relate to the *placement* of the checkpoints, i.e. the segmentation of the program. That is crucial and should be mentioned to better motivate above sentence and the initial trade-off between taping and checkpointing space.
agreed.
pg 5: in the condition for "\it Snp" (3rd last line), it is not immediately apparent to me where the \cup (\hbox{\it Req} \setminus \hbox{\it Sbk} comes from. I suspect this is just my mistake but without that last bit it appears to be directly derived from eqn. (3). With it the intersection on the right hand side may be larger than without which makes me wonder about the "necessary and sufficient"? Since I have a similar problem with the Req_D eqn. I am just guessing I am missing some step but you may want to consider adding this.
This is because Snp also appears in equation (4). Equation (3) alone does imply that Snp >= (out(C) \cup (out(Db) \ ReqD)) \cap use(Cb) in addition (4) implies that Snp >= (out(C) \cup (out(Db) \ ReqD)) \cap (Req \ Sbk) same happens for ReqD. We do think the conditions are necessary and sufficient.
pg 6: In the solution of the set equations - what is the optimality criterion? The text mentions "minimal" and "optimal" solutions. Naively looking at the equations in (6) it seems Opt_2^- is the only part occurring twice so any partition that leaves Opt_2^- empty seems to be better in terms of the total size of Sbk, Snp, Req_D, Req_C ? So I suspect total memory footprint isn't the criterion.
Right. I added that these sets are minimal in the sense that any semantics-preserving choice of the 4 sets is larger or equal to one of these minimal quadruplets.
pg 8: Please clarify "... choices of the other checkpoints around." What is around? It could be that once the notation issue raised under the major issue "A" is addressed this may well become clear.
Rephrased.
-------------------------------------------------------------- [Overview] More explanation on notation is needed.
added explanations on the \overline notation and the rewrite rule on page 4 However, we will also have to spare half a page, so sometimes we have to rely on commonly used notation.
[Comments] page 2, line -8 (8 from the bottom) The term "out" is not defined here.
removed and explained on next occurrence page 4
page 3, line 11 The meaning of the semicolon in the notation $\overline{D}\Doteq\overrightarrow{D};\overleftarrow{D}$ should be explained.
done. This use of semicolon for code sequence is widespread.
page 3, Figure 2 The term "Sbk" is not defined here.
Right, but it is defined upon its first occurrence in the text
page 4, line 3 The notation "$Req\vdash \overline{C;D}$" was hard to be understood because PUSH(Sbk) is undefined here. page 5, line 15 ( eq.(3)) The reason of disappearance of "PUSH(Sbk)" in (1) should be explained.
done. A PUSH does not overwrite any program variable.
-------------------------------------------- 1. Introduction You write: ... it (reverse mode AD) suffers from a high memory consumption... That is only true for the store all or store minimal strategy implemented in Tapenade, ADIFOR 3, and OpenAD. If every intermediate result is recomputed then reverse mode AD needs little memory but suffers from computational costs! You might want to read and cite other papers about AD. You write: This memory consumption is inherent to the nature of gradient computation. That is obviously wrong. You write: to solve the adjoint equations ... is also known to consume memory space. What you mean by this. Almost every computation requires memory (it doesn't consume it like eating bred). AD generated code also solves the adjoint equation! The first two paragraphs in the introduction need to be rewritten totally.
Rewritten as: The methods to compute gradients can be classified in two categories. In the first category, methods use CPU more intensively because several operations are duplicated. This can be through repeated tangent directional derivatives, or through reverse Automatic Differentation using the ``Recompute-All'' strategy. This is not the context of this paper. In the second category, methods spare duplicated operations through increased memory use. This encompasses hand-coded resolution of the ``adjoint equations'' and reverse Automatic Differentation using the ``Store-All'' strategy, which is the context of this work.
2. Reverse Automatic Differentiation You write: AD is a program transformation technique. ... an AD tool creates a new program ... no, see tools based on operator overloading
ok. Text modified.
TBR = to be recorded not to be restored
Right. Fixed.
What is the definition of 'Req'?
Explained in section 2. Rephrased.
What is the difference between Req and use(U)?
In the paper we define Req as use(\overleftarrow{U}), which is smaller than use(U) as said in section 2 and detailed in refs [2,6]
What is the definition of OUT? I guess what you name OUT is actually termed KILL in data flow analysis.
Defined now page 4. KILL usually denotes scalar and array variables completely overwritten. OUT is for variables even only partly overwritten.
3. The equation of checkpointing snapshots You write: This allows the tape consumed ... to be freed ... That is also true without checkpointing. The tape consumed by C is relevant here, it is needed after adjoint of D has finished.
Both what you write and what we wrote is true, and both lead to the conclusion stated by the sentence that follows.
What is Sbk?
Defined in detail on its first occurrence in the text, page 4. True it appears on figure 2 before, but it is just a support for the discussion on page 4.
You write: This also uses memory space, although less than the tape for C. That is not always true.
Ok, but one can almost always write pathological examples for any such assumption. In real life, the snapshot is really smaller. It's an interesting question to detect those "C"'s for which snapshots are larger than the tape, but it's not in the scope of this paper.
4 Discussion and Experimental results Table 1: What run-time is shown, the function or the adjoint? One needs to see the run-time of function, eager adjoint, and lazy adjoint. What is the memory usage and run-time without checkpointing?
added run-time of function. Run-times with eager and lazy do not differ significantly. Most examples would simply run out of memory if no checkpointing was made.
5 Conclusions I am missing an discussion of the computational cost of checkpointing, i.e. the additional flops and the cost of writing and reading the tape. The central question is what is the best mixture of recomputations and taping, not memory usage alone.
Certainly a very general and important question. This paper does not claim to answer this question. But it gives an element to understand it better.
participants (1)
-
Laurent Hascoet