I recently wanted to use Keras, a deep learning framework, to solve an image classification problem and ran into an issue. Keras built-in image load functions assumes that my training data is organized in a single folder with a subfolder for each class of images. This is then replicated for the validation data unless Keras automatic validation split is used. In my case the data where spread out over several folders (an artifact from how the data was sourced) and it would be impractical to copy the data which were already taking up a significant part of the total disk space in the development system.

The solution to this is to use Keras generators. There are two kinds of generators in Keras, either a simple python generator using yield or a class inheriting from keras.utils.Sequence. The later one is the more flexible one and what this post focuses on.

My initial attempt did work but was rather messy to use and when I needed to extend it to handle splitting the data into three parts (test,validation and training) doing that in the original design would have been very messy. So I took a step back and figured that I wanted the following operations.

  • create empty generator
  • add a directory with files to the generator
    this could be extended to add data from other sources or directory structures
  • shuffle the data
  • split the generator into new generators using a list of split-points (real number between 0 and 1)
  • a way to get the class names of the generator
  • a way to get the filename of images yielded by the generator

Of these, the key operations are the splitting and mapping of generated images to filenames. The splitting is important as it allows us to control how many and how large sets we are splitting our data into, allowing for training, validation and test sets or more. The mapping of images back to filenames are important as it allows us to use the generators for prediction as well as allowing us to generate lists of images which the network gets wrong for manual analysis of the networks behaviour.

In addition to this we have some additional operations included later as their need became apparent.

  • A function to set constructor properties after the fact, such as verbosity
  • A function to preload the images into a cache
  • Controls for the batch size used
  • Controls for restricting the maximum number of images per class each epoch

While not central to the functioning of the generator these functionalities proved needed in practical application.


To create a generator based on keras.utils.Sequence we are required to provide a few methods to get it to work.

class SplitSetImageGenerator(keras.utils.Sequence): def __getitem__(self,index): # gets the batch for the supplied index # return a tuple (numpy array of image, numpy array of labels) or None at epoch end def __len__(self): # gets the number of batches # return the number of batches in this epoch (do not change in the middle of an epoch) def on_epoch_end(self): # performs auto shuffle if enabled # Do what we need to do between epochs

Adding our methods we arrive at

class SplitSetImageGenerator(keras.utils.Sequence): def __init__(self): # do initialization def set(self,**attributes): # set some config property, eg batch_size, verbose or max_per_class_and_epoch def add_dir(self,image_dir_reader,*paths): # add the directories in paths to this generator as image sources # image_dir_reader should be a function returning a tuple of lists: # names - filenames of images # classes - class of each image as a number # classnames - names of all the classes in the directory # classindices - companion list to classnames mapping each name to its number def shuffle(self): # shuffle the contents without loosing filename associations def preload(self): # load all images which will cache them if caching is configured def split(self,*splitpoints): # splits the generator at the provided fractions of all images, duplicate fractions # generates empty child generators and non increasing fractions is disallowed def get_filenames(self,indices): # returns the filenames of the images corresponding to the indices in the current epoch def __getitem__(self,index): # gets the batch for the supplied index # return a tuple (numpy array of image, numpy array of labels) or None at epoch end def __len__(self): # gets the number of batches # return the number of batches in this epoch (do not change in the middle of an epoch) def on_epoch_end(self): # performs auto shuffle if enabled # Do what we need to do between epochs

When we have these methods we are starting to be able to write useful code. If we adopt the convention that all methods except split, get_filenames and the methods from keras.utils.Sequence will return self we can now do.

training,validation=SplitSetImageGenerator().add_dir(*paths).shuffle().preload().split(0.8) model.fit_generator(training,validation_data=validation,epochs=10)

Once we have this in place we will not add any more external methods, we will however define some useful properties that the generator will have defined that a user of the generator can access. The primary ones are:

  • filenames - a list of all filenames known to the generator
  • classes - a corresponding list of class numbers for each filename
  • classnames - a list where class names can be looked up from class numbers

These are the ones most useful to access. Some further properties we will define, mostly to configure the behaviour of the generator (using __init__ or the set method) are:

  • batch_size - the number of images returned on each call of __getitem__
  • verbose - to spam or not to spam stdout
  • max_per_class_and_epoch - a limit on how many images of each class to return
  • auto_shuffle - if the generator should be shuffled between epochs
  • scale - a number to scale all pixel values in an image with
  • image_load_function - a function that can load an image into a numpy array
  • image_cache - a cache object that can be passed to the image load function

I think most of these are rather obvious, the one I want to comment on are the max_per_class_and_epoch. I added that one after I got problems with the training, turned out that I had many more examples of one of my classes so the training got stuck in a local maxima where it always predicted that class. This option solved that by ensuring that in each epoch the generator will always produce the same number of each class as long as its value is set lower than the number of images in the smallest class in the training set.

I will not go through the implementation in detail, if you are interested you can look at the source yourself. I will however show some examples of how to use the code.


To use the generator some steps are needed and other are probably recommended. The following example shows how to read images from a folder in the same manner as Keras built in image data generator and then split that dataset in a consistent way. I will be using the EuroSAT dataset available at https://github.com/phelber/eurosat in this example.

# build the data generators test_validation_train_split=[0.2,0.4] test_set,validation_set,training_set=[dataset.set(verbose=False) for dataset in SplitSetImageGenerator(image_load_function=read_image,scale=1.0/255) .add_dir(image_data_generator_dir_reader,'data/EuroSat/jpg/') .shuffle() .split(*test_validation_train_split)] # preload images to speed up training for s in [validation_set,training_set]: s.set(verbose=True).preload().set(verbose=False).shuffle()

As can be seen from the code we start by creating the image generator and passing it an image load function (to be defined later) and a scale factor (here used to scale pixels into the range 0-1). We then add a directory with data to the generator by passing a reader function (to be defined) as well as a path to a directory of images. At this point we have a generator capable of being used in training etc.

In the next step we shuffle the generator to avoid the risk of all the images of some classes ending up in the same part of the data when we split into test, training and validation sets. We follow the shuffle by splitting the data placing the data in the range of 0%-20% into the first set, 20%-40% into the next set and 40%-100% into the last set. We then disables verbosity for all sets and store the sets as test, validation and training.

The final step is preloading the images in the validation and training set to avoid slowdowns caused by disk access during training.

To make this work we need to define the functions for reading a directory and for reading the individual image files. We do that using the following code.

read_image_cache={} def read_image(path, rescale=None): key="{},{}".format(path,rescale) if key in read_image_cache: return read_image_cache[key] else: img=image.load_img(path) data=image.img_to_array(img) if rescale!=None: data=data*rescale read_image_cache[key]=data return data # function to return filenames and classes of images # also returns a list of class names and a list of class indices corresponding to the class names def image_data_generator_dir_reader(path): sys.stdout=sys.stderr # redirect problematic output # here we use the keras ImageDataGenerator to get a list of filenames and classes ig = image.ImageDataGenerator() gen = ig.flow_from_directory(path) sys.stdout=sys.__stdout__ # restore stdout names=[os.path.normpath(path+'/'+n.replace('\\','/')).replace('\\','/') for n in gen.filenames] return (names,gen.classes,*zip(*gen.class_indices.items()))

The first of these functions reads a single image using Keras load_img function and caches it as well as applying any supplied rescaling.

The second function uses the Keras ImageDataGenerator to get filenames and classes for them from a directory. If the data is stored in some other organisation than the one handled by Keras ImageDataGenerator we only need to supply a function of this type that can read that format to add_dir and we can keep using the rest of the code unchanged and without needing to reorganize data on disk. Also as were the original motivation we are not restricted to one call to add_dir but can add many directories if we have several datasets we want to combine.

Having read the data we can then define a simple model and train a network using this code. (full source here example.py)

################### MODEL DEFINITION ################### # this is not an optimized model, just a simple example # for good results this model needs some thought model = Sequential() model.add(Conv2D(60, 5, input_shape=training_set.shape)) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(20, 5, input_shape=training_set.shape)) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(120)) model.add(Dense(60)) model.add(Dense(training_set.num_classes)) model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001,decay=0.001), metrics=['accuracy']) ################### TRAINING ################### history = model.fit_generator( training_set, validation_data=validation_set, epochs=30 )

Running this produces output as follows and as we can see we are with this basic model still reaching validation accuracies of 75%.

Epoch 1/120 125/125 [===========] - 39s 311ms/step - loss: 2.3054 - acc: 0.1133 - val_loss: 2.2371 - val_acc: 0.1220 Epoch 2/120 125/125 [===========] - 39s 312ms/step - loss: 2.0607 - acc: 0.1675 - val_loss: 1.8251 - val_acc: 0.2598 Epoch 3/120 125/125 [===========] - 39s 311ms/step - loss: 1.6701 - acc: 0.3575 - val_loss: 1.5512 - val_acc: 0.4070 Epoch 4/120 125/125 [===========] - 39s 310ms/step - loss: 1.4910 - acc: 0.4300 - val_loss: 1.4137 - val_acc: 0.4667 Epoch 5/120 125/125 [===========] - 39s 312ms/step - loss: 1.4091 - acc: 0.4610 - val_loss: 1.3520 - val_acc: 0.5046 Epoch 6/120 125/125 [===========] - 40s 318ms/step - loss: 1.3175 - acc: 0.5067 - val_loss: 1.3070 - val_acc: 0.4922 Epoch 7/120 125/125 [===========] - 39s 314ms/step - loss: 1.3077 - acc: 0.5090 - val_loss: 1.3010 - val_acc: 0.4761 Epoch 8/120 125/125 [===========] - 39s 315ms/step - loss: 1.2480 - acc: 0.5327 - val_loss: 1.2288 - val_acc: 0.5443 Epoch 9/120 125/125 [===========] - 39s 311ms/step - loss: 1.2073 - acc: 0.5588 - val_loss: 1.2555 - val_acc: 0.5157 Epoch 10/120 125/125 [===========] - 39s 310ms/step - loss: 1.2273 - acc: 0.5618 - val_loss: 1.1627 - val_acc: 0.5794 ... Epoch 110/120 125/125 [===========] - 31s 249ms/step - loss: 0.7036 - acc: 0.7505 - val_loss: 0.7085 - val_acc: 0.7424 Epoch 111/120 125/125 [===========] - 31s 248ms/step - loss: 0.7144 - acc: 0.7535 - val_loss: 0.7177 - val_acc: 0.7413 Epoch 112/120 125/125 [===========] - 31s 249ms/step - loss: 0.7088 - acc: 0.7630 - val_loss: 0.7053 - val_acc: 0.7535 Epoch 113/120 125/125 [===========] - 31s 249ms/step - loss: 0.6910 - acc: 0.7620 - val_loss: 0.6994 - val_acc: 0.7513 Epoch 114/120 125/125 [===========] - 31s 249ms/step - loss: 0.7053 - acc: 0.7518 - val_loss: 0.6969 - val_acc: 0.7531 Epoch 115/120 125/125 [===========] - 31s 249ms/step - loss: 0.6863 - acc: 0.7655 - val_loss: 0.6980 - val_acc: 0.7544 Epoch 116/120 125/125 [===========] - 31s 248ms/step - loss: 0.6859 - acc: 0.7600 - val_loss: 0.7182 - val_acc: 0.7433 Epoch 117/120 125/125 [===========] - 31s 249ms/step - loss: 0.7222 - acc: 0.7460 - val_loss: 0.6948 - val_acc: 0.7528 Epoch 118/120 125/125 [===========] - 31s 248ms/step - loss: 0.7032 - acc: 0.7602 - val_loss: 0.7140 - val_acc: 0.7444 Epoch 119/120 125/125 [===========] - 31s 248ms/step - loss: 0.6917 - acc: 0.7615 - val_loss: 0.6946 - val_acc: 0.7496 Epoch 120/120 125/125 [===========] - 31s 247ms/step - loss: 0.6862 - acc: 0.7562 - val_loss: 0.6945 - val_acc: 0.7502

That's all for this post, I hope to write more about machine learning in the feature if I do you should be able to find them using the tags on this post.

All source code for this post

generator: generators.py
example: example.py

Feel free to use this code for any and all purposes, consider it in the public domain or if that is not workable for you you can use it under the terms of the MIT License