Please find attached:
-- The revised version of our ICCS 2006 paper
as a tar.gz file with the latex source,
2 postscript figures and a .bib bibliography
-- The concatenation of the three referees reports,
with our replies inserted inside.
The paper suggests a formal approach to characterize
checkpointing sets vs taping. Both the application to
programs that do not exhibit a special looping structure
as well as the characterization of minimal solutions are new.
It is a very usefull foray into using data flow for
determining checkpointing sets and should definitly be
accepted.
The issues mentioned below should improve readability
for the general audience.
major comments:
A:
I realize notation regarding the result of
the source transformation of program subsections in
terms of forward and reverse is difficult
because it does not lend itself easily to checkpointing schemes.
E.g. on pg.3 Sect.3 states:
$\overline{C}\Doteq\overrightarrow{C};\overleftarrow{C}$
and in this notation there is no mentioning of taking
the checkpoint, running forward and later restoring the checkpoint
although this certainly belongs to the source transformation of
code section $C$ which is presumably what is denoted by $\overline{C}$
if one goes by the explanation of $P$ and $\overline{P}$ on page 2.
Then on the top of page 4 there is a different notation that does
includes the store/restore etc.
>> Right. I added notation [] to denote checkpointing.
>> \overline{C;D} is thus
>> \overrightarrow{C};\overrightarrow{D};\overleftarrow{D};\overleftarrow{C}
>> whereas \overline{[C];D} is the reverse diff program with C checkpointed
>> with the store, restore, etc.
However, on page 4 I am missing
an explanation of the notation "\hbox{\it Req} \vdash"
which is repeatedly used as something being executed while
"Req" itself is a set. The notation should be clarified.
>> Req is not a piece of code. It is a context in which
>> the code generation is done. When using rewrite rules,
>> people use \vdash to mean this.
>> Yes the rewrite rules here are ambiguous.
>> I put a box around to distinguish the rewrite terms
>> plus one line of explanation.
B:
I don't see a clear motivation for the backward snapshot
as opposed to the tape. In the paper just states it is a "hybrid".
The need for the regular checkpoints is clear.
>> Backward snapshot does not save memory space,
>> but may reduce memory traffic (e.g only one
>> push/pop pair instead of 2. Added this on page 4.
Sect 4. mentions that there can be a trade-off between
storing/restoring things in big blocks vs.
doing it one at a time. This would presumably be the implied
motivation for the backward snapshots.
It would be nice if the conclusions section could address this by
showing more detail about profile data for taping push/pops
vs. storing/restoring snapshots.
>> Our first measurements on memory traffic are too recent,
>> and not conclusive yet.
C:
I missed any mentioning of checkpoint placement as an issue.
The space constraints don't allow to go into much detail.
However, it should be pointed out that (aside from special
cases such as time stepping loops) the segmentation of the code
into the subsections U,C,D
which in this paper is predetermined can have a substantial
impact on the checkpoint size.
>> I took your words to rephrase the end of the conclusion,
>> which meant that.
minor comments:
pg 2: If there isn't a special purpose for calling the three parts
upstream/center/downstream, could they be renamed A/B/C instead?
>> I sort of like using U for Upstream, C for Checkpoint, D for Downstream ?
pg 2: Can you cite the TBR paper at the first mentioning of TBR
analysis .
>> Done
pg 2: change "... finds the Req set of these required values." to
"...finds the set of these required values denoted by Req."
>> ok
May be there is a chance to also actually
say what precisely "out()" and "use()" mean, in particular
if it is 'local use' only of cumulative with the 'use' of
a segments 'callees'. To the uninitiated
reader it is probably not clear enough from the context.
>> added a few words on bottom of page 4
pg 2: the sentence
"In other words ... must be built in such a way that:"
with the following formula should either be removed or better
explained. The assertion "must be built" wrt. the formula
becomes clear only after reading on and going back later.
>> Good. We gain 3 lines!
pg 3: I think the sentence
"This also uses memory space, although less than the tape for ..."
should really relate to the *placement* of the checkpoints, i.e.
the segmentation of the program. That is crucial and should be
mentioned to better motivate above sentence and the initial
trade-off between taping and checkpointing space.
>> agreed.
pg 5: in the condition for "\it Snp" (3rd last line),
it is not immediately apparent to me where the
\cup (\hbox{\it Req} \setminus \hbox{\it Sbk}
comes from. I suspect this is just my mistake but
without that last bit it appears
to be directly derived from eqn. (3). With
it the intersection on the right hand side may be larger
than without which makes me wonder
about the "necessary and sufficient"? Since I have
a similar problem with the Req_D eqn. I am just
guessing I am missing some step but you may want to
consider adding this.
>> This is because Snp also appears in
>> equation (4). Equation (3) alone does imply that
>> Snp >= (out(C) \cup (out(Db) \ ReqD)) \cap use(Cb)
>> in addition (4) implies that
>> Snp >= (out(C) \cup (out(Db) \ ReqD)) \cap (Req \ Sbk)
>> same happens for ReqD.
>> We do think the conditions are necessary and sufficient.
pg 6: In the solution of the set equations - what is the
optimality criterion? The text mentions "minimal" and
"optimal" solutions. Naively looking at the equations in (6)
it seems Opt_2^- is the only part occurring twice so any
partition that leaves Opt_2^- empty seems to be better
in terms of the total size of Sbk, Snp, Req_D, Req_C ?
So I suspect total memory footprint isn't the criterion.
>> Right. I added that these sets are minimal in the sense that any
>> semantics-preserving choice of the 4 sets is
>> larger or equal to one of these minimal quadruplets.
pg 8: Please clarify "... choices of the other checkpoints around."
What is around? It could be that once the notation issue raised
under the major issue "A" is addressed this may well become clear.
>> Rephrased.
--------------------------------------------------------------
[Overview]
More explanation on notation is needed.
>> added explanations on the \overline notation
>> and the rewrite rule on page 4
>> However, we will also have to spare half a page,
>> so sometimes we have to rely on commonly
>> used notation.
[Comments]
page 2, line -8 (8 from the bottom)
The term "out" is not defined here.
>> removed and explained on next occurrence page 4
page 3, line 11
The meaning of the semicolon in the notation
$\overline{D}\Doteq\overrightarrow{D};\overleftarrow{D}$
should be explained.
>> done. This use of semicolon for code sequence is widespread.
page 3, Figure 2
The term "Sbk" is not defined here.
>> Right, but it is defined upon its first occurrence in the text
page 4, line 3
The notation "$Req\vdash \overline{C;D}$" was hard to be understood
because PUSH(Sbk) is undefined here.
page 5, line 15 ( eq.(3))
The reason of disappearance of "PUSH(Sbk)" in (1) should be explained.
>> done. A PUSH does not overwrite any program variable.
--------------------------------------------
1. Introduction
You write: ... it (reverse mode AD) suffers from a high memory consumption...
That is only true for the store all or store minimal strategy
implemented in Tapenade, ADIFOR 3, and OpenAD.
If every intermediate result is recomputed then reverse mode AD needs little
memory but suffers from computational costs!
You might want to read and cite other papers about AD.
You write: This memory consumption is inherent to the nature of
gradient computation.
That is obviously wrong.
You write: to solve the adjoint equations ... is also known to consume
memory space.
What you mean by this. Almost every computation requires memory
(it doesn't consume it like eating bred).
AD generated code also solves the adjoint equation!
The first two paragraphs in the introduction need to be rewritten totally.
>> Rewritten as:
>> The methods to compute gradients can be classified in two categories.
>> In the first category, methods use CPU more intensively because
>> several operations are duplicated. This can be through
>> repeated tangent directional derivatives, or through reverse
>> Automatic Differentation using the ``Recompute-All'' strategy.
>> This is not the context of this paper. In the second category,
>> methods spare duplicated operations through increased memory use.
>> This encompasses hand-coded resolution of the ``adjoint equations''
>> and reverse Automatic Differentation using the ``Store-All'' strategy,
>> which is the context of this work.
2. Reverse Automatic Differentiation
You write: AD is a program transformation technique.
... an AD tool creates a new program ...
no, see tools based on operator overloading
>> ok. Text modified.
TBR = to be recorded not to be restored
>> Right. Fixed.
What is the definition of 'Req'?
>> Explained in section 2. Rephrased.
What is the difference between Req and use(U)?
>> In the paper we define Req as use(\overleftarrow{U}),
>> which is smaller than use(U) as said in section 2 and
>> detailed in refs [2,6]
What is the definition of OUT?
I guess what you name OUT is actually termed KILL in data flow analysis.
>> Defined now page 4. KILL usually denotes scalar and array
>> variables completely overwritten. OUT is for variables even
>> only partly overwritten.
3. The equation of checkpointing snapshots
You write: This allows the tape consumed ... to be freed ...
That is also true without checkpointing.
The tape consumed by C is relevant here,
it is needed after adjoint of D has finished.
>> Both what you write and what we wrote is true,
>> and both lead to the conclusion stated by
>> the sentence that follows.
What is Sbk?
>> Defined in detail on its first occurrence in the text, page 4.
>> True it appears on figure 2 before, but it is just a
>> support for the discussion on page 4.
You write: This also uses memory space, although less than the tape for C.
That is not always true.
>> Ok, but one can almost always write pathological examples
>> for any such assumption. In real life, the snapshot is really
>> smaller. It's an interesting question to detect those "C"'s
>> for which snapshots are larger than the tape, but
>> it's not in the scope of this paper.
4 Discussion and Experimental results
Table 1: What run-time is shown, the function or the adjoint?
One needs to see the run-time of function, eager adjoint, and lazy adjoint.
What is the memory usage and run-time without checkpointing?
>> added run-time of function.
>> Run-times with eager and lazy do not differ significantly.
>> Most examples would simply run out of memory if no
>> checkpointing was made.
5 Conclusions
I am missing an discussion of the computational cost of checkpointing,
i.e. the additional flops and the cost of writing and reading the tape.
The central question is what is the best mixture of recomputations and
taping, not memory usage alone.
>> Certainly a very general and important question.
>> This paper does not claim to answer this question.
>> But it gives an element to understand it better.