Utilizing generators to use Keras training with existing file structure

I recently wanted to use Keras, a deep learning framework, to solve an image classification problem and ran into an issue. Keras built-in image load functions assumes that my training data is organized in a single folder with a subfolder for each class of images. This is then replicated for the validation data unless Keras automatic validation split is used. In my case the data where spread out over several folders (an artifact from how the data was sourced) and it would be impractical to copy the data which were already taking up a significant part of the total disk space in the development system.

The solution to this is to use Keras generators. There are two kinds of generators in Keras, either a simple python generator using yield or a class inheriting from keras.utils.Sequence. The later one is the more flexible one and what this post focuses on.

My initial attempt did work but was rather messy to use and when I needed to extend it to handle splitting the data into three parts (test,validation and training) doing that in the original design would have been very messy. So I took a step back and figured that I wanted the following operations.

create empty generator
add a directory with files to the generator
this could be extended to add data from other sources or directory structures
shuffle the data
split the generator into new generators using a list of split-points (real number between 0 and 1)
a way to get the class names of the generator
a way to get the filename of images yielded by the generator

Of these, the key operations are the splitting and mapping of generated images to filenames. The splitting is important as it allows us to control how many and how large sets we are splitting our data into, allowing for training, validation and test sets or more. The mapping of images back to filenames are important as it allows us to use the generators for prediction as well as allowing us to generate lists of images which the network gets wrong for manual analysis of the networks behaviour.

In addition to this we have some additional operations included later as their need became apparent.

A function to set constructor properties after the fact, such as verbosity
A function to preload the images into a cache
Controls for the batch size used
Controls for restricting the maximum number of images per class each epoch

While not central to the functioning of the generator these functionalities proved needed in practical application.

To create a generator based on keras.utils.Sequence we are required to provide a few methods to get it to work.


class SplitSetImageGenerator(keras.utils.Sequence):
	def __getitem__(self,index): # gets the batch for the supplied index
		# return a tuple (numpy array of image, numpy array of labels) or None at epoch end
	def __len__(self): # gets the number of batches
		# return the number of batches in this epoch (do not change in the middle of an epoch)
	def on_epoch_end(self): # performs auto shuffle if enabled
		# Do what we need to do between epochs

Adding our methods we arrive at


class SplitSetImageGenerator(keras.utils.Sequence):
	def __init__(self):
		# do initialization
	def set(self,**attributes):
		# set some config property, eg batch_size, verbose or max_per_class_and_epoch
	def add_dir(self,image_dir_reader,*paths):
		# add the directories in paths to this generator as image sources
		# image_dir_reader should be a function returning a tuple of lists:
		# 	names        - filenames of images
		#	classes      - class of each image as a number
		#	classnames   - names of all the classes in the directory
		#	classindices - companion list to classnames mapping each name to its number
	def shuffle(self):
		# shuffle the contents without loosing filename associations
	def preload(self):
		# load all images which will cache them if caching is configured
	def split(self,*splitpoints):
		# splits the generator at the provided fractions of all images, duplicate fractions
		# generates empty child generators and non increasing fractions is disallowed
	def get_filenames(self,indices):
		# returns the filenames of the images corresponding to the indices in the current epoch 
	def __getitem__(self,index): # gets the batch for the supplied index
		# return a tuple (numpy array of image, numpy array of labels) or None at epoch end
	def __len__(self): # gets the number of batches
		# return the number of batches in this epoch (do not change in the middle of an epoch)
	def on_epoch_end(self): # performs auto shuffle if enabled
		# Do what we need to do between epochs

When we have these methods we are starting to be able to write useful code. If we adopt the convention that all methods except split, get_filenames and the methods from keras.utils.Sequence will return self we can now do.


	training,validation=SplitSetImageGenerator().add_dir(*paths).shuffle().preload().split(0.8)
	model.fit_generator(training,validation_data=validation,epochs=10)

Once we have this in place we will not add any more external methods, we will however define some useful properties that the generator will have defined that a user of the generator can access. The primary ones are:

filenames - a list of all filenames known to the generator
classes - a corresponding list of class numbers for each filename
classnames - a list where class names can be looked up from class numbers

These are the ones most useful to access. Some further properties we will define, mostly to configure the behaviour of the generator (using __init__ or the set method) are:

batch_size - the number of images returned on each call of __getitem__
verbose - to spam or not to spam stdout
max_per_class_and_epoch - a limit on how many images of each class to return
auto_shuffle - if the generator should be shuffled between epochs
scale - a number to scale all pixel values in an image with
image_load_function - a function that can load an image into a numpy array
image_cache - a cache object that can be passed to the image load function

I think most of these are rather obvious, the one I want to comment on are the max_per_class_and_epoch. I added that one after I got problems with the training, turned out that I had many more examples of one of my classes so the training got stuck in a local maxima where it always predicted that class. This option solved that by ensuring that in each epoch the generator will always produce the same number of each class as long as its value is set lower than the number of images in the smallest class in the training set.

I will not go through the implementation in detail, if you are interested you can look at the source yourself. I will however show some examples of how to use the code.

To use the generator some steps are needed and other are probably recommended. The following example shows how to read images from a folder in the same manner as Keras built in image data generator and then split that dataset in a consistent way. I will be using the EuroSAT dataset available at https://github.com/phelber/eurosat in this example.


# build the data generators
test_validation_train_split=[0.2,0.4]
test_set,validation_set,training_set=[dataset.set(verbose=False) for dataset
		in SplitSetImageGenerator(image_load_function=read_image,scale=1.0/255)
				.add_dir(image_data_generator_dir_reader,'data/EuroSat/jpg/')
				.shuffle()
				.split(*test_validation_train_split)]

# preload images to speed up training
for s in [validation_set,training_set]:
	s.set(verbose=True).preload().set(verbose=False).shuffle()

As can be seen from the code we start by creating the image generator and passing it an image load function (to be defined later) and a scale factor (here used to scale pixels into the range 0-1). We then add a directory with data to the generator by passing a reader function (to be defined) as well as a path to a directory of images. At this point we have a generator capable of being used in training etc.

In the next step we shuffle the generator to avoid the risk of all the images of some classes ending up in the same part of the data when we split into test, training and validation sets. We follow the shuffle by splitting the data placing the data in the range of 0%-20% into the first set, 20%-40% into the next set and 40%-100% into the last set. We then disables verbosity for all sets and store the sets as test, validation and training.

The final step is preloading the images in the validation and training set to avoid slowdowns caused by disk access during training.

To make this work we need to define the functions for reading a directory and for reading the individual image files. We do that using the following code.


read_image_cache={}
def read_image(path, rescale=None):
	key="{},{}".format(path,rescale)
	if key in read_image_cache:
		return read_image_cache[key]
	else:
		img=image.load_img(path)
		data=image.img_to_array(img)
		if rescale!=None:
			data=data*rescale
		read_image_cache[key]=data
		return data

# function to return filenames and classes of images
# also returns a list of class names and a list of class indices corresponding to the class names
def image_data_generator_dir_reader(path):
	sys.stdout=sys.stderr # redirect problematic output
	# here we use the keras ImageDataGenerator to get a list of filenames and classes
	ig = image.ImageDataGenerator()
	gen = ig.flow_from_directory(path)
	sys.stdout=sys.__stdout__ # restore stdout
	names=[os.path.normpath(path+'/'+n.replace('\\','/')).replace('\\','/') for n in gen.filenames]
	return (names,gen.classes,*zip(*gen.class_indices.items()))

The first of these functions reads a single image using Keras load_img function and caches it as well as applying any supplied rescaling.

The second function uses the Keras ImageDataGenerator to get filenames and classes for them from a directory. If the data is stored in some other organisation than the one handled by Keras ImageDataGenerator we only need to supply a function of this type that can read that format to add_dir and we can keep using the rest of the code unchanged and without needing to reorganize data on disk. Also as were the original motivation we are not restricted to one call to add_dir but can add many directories if we have several datasets we want to combine.

Having read the data we can then define a simple model and train a network using this code. (full source here example.py)


################### MODEL DEFINITION ###################
		
# this is not an optimized model, just a simple example
# for good results this model needs some thought

model = Sequential()

model.add(Conv2D(60, 5, input_shape=training_set.shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(20, 5, input_shape=training_set.shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())

model.add(Dense(120))
model.add(Dense(60))
model.add(Dense(training_set.num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
	      optimizer=Adam(lr=0.0001,decay=0.001),
	      metrics=['accuracy'])

################### TRAINING ###################

history = model.fit_generator(
	training_set,
	validation_data=validation_set,
	epochs=30
)

Running this produces output as follows and as we can see we are with this basic model still reaching validation accuracies of 75%.


Epoch 1/120
125/125 [===========] - 39s 311ms/step - loss: 2.3054 - acc: 0.1133 - val_loss: 2.2371 - val_acc: 0.1220
Epoch 2/120
125/125 [===========] - 39s 312ms/step - loss: 2.0607 - acc: 0.1675 - val_loss: 1.8251 - val_acc: 0.2598
Epoch 3/120
125/125 [===========] - 39s 311ms/step - loss: 1.6701 - acc: 0.3575 - val_loss: 1.5512 - val_acc: 0.4070
Epoch 4/120
125/125 [===========] - 39s 310ms/step - loss: 1.4910 - acc: 0.4300 - val_loss: 1.4137 - val_acc: 0.4667
Epoch 5/120
125/125 [===========] - 39s 312ms/step - loss: 1.4091 - acc: 0.4610 - val_loss: 1.3520 - val_acc: 0.5046
Epoch 6/120
125/125 [===========] - 40s 318ms/step - loss: 1.3175 - acc: 0.5067 - val_loss: 1.3070 - val_acc: 0.4922
Epoch 7/120
125/125 [===========] - 39s 314ms/step - loss: 1.3077 - acc: 0.5090 - val_loss: 1.3010 - val_acc: 0.4761
Epoch 8/120
125/125 [===========] - 39s 315ms/step - loss: 1.2480 - acc: 0.5327 - val_loss: 1.2288 - val_acc: 0.5443
Epoch 9/120
125/125 [===========] - 39s 311ms/step - loss: 1.2073 - acc: 0.5588 - val_loss: 1.2555 - val_acc: 0.5157
Epoch 10/120
125/125 [===========] - 39s 310ms/step - loss: 1.2273 - acc: 0.5618 - val_loss: 1.1627 - val_acc: 0.5794

...

Epoch 110/120
125/125 [===========] - 31s 249ms/step - loss: 0.7036 - acc: 0.7505 - val_loss: 0.7085 - val_acc: 0.7424
Epoch 111/120
125/125 [===========] - 31s 248ms/step - loss: 0.7144 - acc: 0.7535 - val_loss: 0.7177 - val_acc: 0.7413
Epoch 112/120
125/125 [===========] - 31s 249ms/step - loss: 0.7088 - acc: 0.7630 - val_loss: 0.7053 - val_acc: 0.7535
Epoch 113/120
125/125 [===========] - 31s 249ms/step - loss: 0.6910 - acc: 0.7620 - val_loss: 0.6994 - val_acc: 0.7513
Epoch 114/120
125/125 [===========] - 31s 249ms/step - loss: 0.7053 - acc: 0.7518 - val_loss: 0.6969 - val_acc: 0.7531
Epoch 115/120
125/125 [===========] - 31s 249ms/step - loss: 0.6863 - acc: 0.7655 - val_loss: 0.6980 - val_acc: 0.7544
Epoch 116/120
125/125 [===========] - 31s 248ms/step - loss: 0.6859 - acc: 0.7600 - val_loss: 0.7182 - val_acc: 0.7433
Epoch 117/120
125/125 [===========] - 31s 249ms/step - loss: 0.7222 - acc: 0.7460 - val_loss: 0.6948 - val_acc: 0.7528
Epoch 118/120
125/125 [===========] - 31s 248ms/step - loss: 0.7032 - acc: 0.7602 - val_loss: 0.7140 - val_acc: 0.7444
Epoch 119/120
125/125 [===========] - 31s 248ms/step - loss: 0.6917 - acc: 0.7615 - val_loss: 0.6946 - val_acc: 0.7496
Epoch 120/120
125/125 [===========] - 31s 247ms/step - loss: 0.6862 - acc: 0.7562 - val_loss: 0.6945 - val_acc: 0.7502

That's all for this post, I hope to write more about machine learning in the feature if I do you should be able to find them using the tags on this post.

All source code for this post

generator: generators.py
example: example.py

Feel free to use this code for any and all purposes, consider it in the public domain or if that is not workable for you you can use it under the terms of the MIT License

Comments