11 Using the generator to train a merged model
A merged model combines the output of two (or more) Sequential models. Commonly, one branch (let’s say the left) would be a convnet trained on image data while the right branch would be some other non-image feature data, such as the output of a LSTM model (see video question example) or simply a “traditional” design matrix of numerical features. A key point relevant to our discussion is that the design matrix for the right branch will have as many rows as there are images in the training, validation and testing data, so that the output of the left model can be concatenated column-wise with the right-side design matrix.
The idea of using a generator for training and validating a merged model has been posed and explored in Keras issues 3386 and 3466 but only for design matrices that can fit into memory. Given our use case of millions of images, if the design matrix consists of thousands of features, then it is unlikely that it will fit into main memory and so we expand on our generator above to address this case.
Let’s elaborate by example: in the simple Sequential model discussed initially in this post, the generator returns a tuple of input matrix
X and target vector
Y for each mini-batch. Together, these
Y represent a random sample of the main DataFrame, whose length represents all the samples in one epoch. Each row of
X contains the pixel values of the image read from the file path found in the DataFrame and each cell of
Y contains the target label for that image. In a merged model, we want an additional row of numerical feature data to be associated with each image. For each mini-batch, then, the generator will return the same
Y as before and also a second matrix
X2 with the same number of rows (the number of columns represents the number of features of this second matrix). All of the mini-batch matrices
X2 for the epoch, when stacked on top of each other, are just the total feature matrix for the right-side model that we will call
bc (to honor
Returning (reluctantly) to our dogs and cats example, one could imagine a set of 12 numbers that represents the monthly weight (normalized, say, to the birth weight) for the first year of growth for each cat or dog in the associated image (not that such data is available for the Kaggle data, I’m just making this up…). The merged model, then, would take into account the (presumably) different growth rates of dogs and cats and might improve upon the images-only model.
X is input into the same convolutional network as before. The output of the last convnet layer is concatenated column-wise with
X2 and this resultant matrix is now input into a fully connected layer, whose output is compared with
Y in the overall loss function.
If the secondary matrix
bc can fit into memory, then it can be passed directly to the generator and
idx of the entire shuffled DataFrame from the top of the epoch can be used to correctly rearrange
Then chunks of
bc for the mini-batch are found using indexes
j are incremented by the size of the mini-batch during each iteration of the mini-batch loop:
11.1 Feature array is too large for memory
However, our secondary feature data matrix
bc is too large to fit into memory: consider a feature matrix for millions of images with thousands of columns. When the main DataFrame is shuffled at the beginning of each epoch and each mini-batch is selected as a chunk, the appropriate rows of
bc must be selected in the same order as the shuffled mini-batch. In such a case, we use the splendid efficiency of
Let’s say the index column of the shuffled mini-batch of the DataFrame is called
idx and contains the values
[13, 5, 1, 7]. Using the example above, our mini-batch would look like this:
Those four image files are read and their pixels returned as rows of
X. The index is obtained and is used to index into the
bcolz array on disk via:
bcolz arrays stored on disk are compressed and very fast to read, this is an efficient way to store and use sparse matrices without using a special sparse matrix data structure. There may be a not-insignificant performance hit, since the
bcolz file on disk must be accessed during each mini-batch but this is the trade off made for being able to process huge amounts of feature data with limited machine memory. Well, as Billy Bragg once told me, you have to learn to take the crunchy with the smooth, I suppose.
We implemented a merged model with random
X2 data for the (sigh, sadly) dogs and cats in the script classifier_from_little_data_script_dfgen_merged.py. It will not improve your model accuracy (really–it's just random feature data we created) but it will demonstrate how to train a merged model with millions of images and with feature data that will not fit into memory. Be sure to run (and understand) the generator testing code by running the other file first, `python akmtdfgen.py`.
Also, see the code for some potential pain involving multithreading if you have an older version of
bcolz and/or its underlying
blosc compression library. If you use
conda (or not), the following will make sure you have the latest:
pip install --upgrade bcolz. The docstring of akmtdfgen.py contains the output of
bcolz.print_versions() used for this post.
That’s it! Now you have the tools to train your models on as much data as you can store on disk. You will be time (and budget) constrained but no longer machine memory constrained. See (and run!) the accompanying code to really understand how this all works. Keras is a moving target of goodness and we must stay on our toes.