I'm having some trouble understanding the purpose of causal convolutions. Suppose I'm doing time-series classification using a convolutional network with 1 layer and kernel=2, stride=1, dilation=0. Isn't it the same thing as shifting my output back by 1?
For larger networks, it would be a little more involved to take into account the parameters of all the layers to get the resulting receptive field to do a proper output shift. To me it seems, if there is some leak, you could always account for the leak by shifting the output back.
For example, if at time step $t_2$, a non-causal CNN sees $x_0, x_1, x_2, x_3, x_4$, then you'd use the target associated with $t_4$, i.e. $y_4$
Edit: I've seen diagrams for causal CNNs where all the arrows a right-aligned. I get that it's meant to illustrate that $y_t$ aligns to $x_t$, but couldn't you just as easily draw them like this:
 
                        
The point of causal convolutions is not to see 'future' data. This is important in real time sequential analysis because we won't have access to new information before it happens, however we typically do in training (due to having the whole training sequence). Therefore, causal convolutions begin
t-k//2and end att(wheret= current timestep andk= kernel size), rather than a typical convolution which starts att-k//2and end att+k//2. This can be imagined as a 1-sided kernel, where instead of having the target pixel/sample be in the centre of the kernel, it's now the rightmost (going from L-R) part of the kernel.Using your example, if the top orange dot in the following picture is
t_n, thent_nhas a receptive field stemming fromt_n-4tot_ndue to it having a kernel size of 2 and 4 layers.Compare that to a noncausal convolution (ignore the dilated convolution on the right), where the receptive field stems from
t_n-3tot_n+3due to it being a double-sided kernel: