If we want to stack an LSTM on top of a convolutional layers, we can simply do so, but we need to
reshape the output of the convolutions to the LSTM's expected input. The code below shows an implementation in Keras:
T = 128 D = 64 N_FILTERS = 32 LSTM_T = int(T / 8) LSTM_D = int(D / 2) LSTM_STATE = 128 POOL = (8, 2) i = Input(shape=(T, D, 1)) # (None, 128, 64, 1) x = Conv2D(N_FILTERS, (3, 3), padding = 'same')(i) # (None, 128, 64, 32) x = MaxPooling2D(pool_size=POOL)(x) # (None, 16, 32, 32) x = Reshape((LSTM_T, LSTM_D * N_FILTERS))(x) # (None, 16, 1024) x = LSTM(LSTM_STATE, return_sequences=True)(x) # (None, 16, 128) model = Model(i, x) model.summary()
In this example we want to learn the convolutional LSTM on sequences of length 128 with 64 dimensional samples. The first layer is a convolutional layer with 32 filters. So the outputs are 32
sequences, one for each filter. We pool the sequences with a (8, 2) window. So our 32 sequences are now of size (128 / 8 = 16, 64 / 2 = 32). Now we have to combine the dimensions and the filter responses into a single dimension of size (32 * 32 = 1024) so we can feed a sequence into the LSTM which requires a rank 2 ( or 3 with batch) tensor with the first dimension being the time step and the second each frame. Finally we add the LSTM layer.