从头开始在Python中开发深度学习字幕生成模型

一个普普通通简简单单平平凡凡的神 · 掘金 · · 2017-12-12 08:07

正文

请到「今天看啥」查看全文

我们将用以下方式清洗文本，以减少需要处理的词汇量：

所有单词全部转换成小写。
移除所有标点符号。
移除所有少于或等于一个字符的单词（如 a）。
移除所有带数字的单词。

下面定义了 clean_descriptions() 函数：给出描述的图像标识符词典，遍历每个描述，清洗文本。

import string

def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc_list in descriptions.items():
		for i in range(len(desc_list)):
			desc = desc_list[i]
			# tokenize
			desc = desc.split()
			# convert to lower case
			desc = [word.lower() for word in desc]
			# remove punctuation from each token
			desc = [w.translate(table) for w in desc]
			# remove hanging 's' and 'a'
			desc = [word for word in desc if len(word)>1]
			# remove tokens with numbers in them
			desc = [word for word in desc if word.isalpha()]
			# store as string
			desc_list[i] =  ' '.join(desc)

# clean descriptions
clean_descriptions(descriptions)

清洗后，我们可以总结词汇量。

理想情况下，我们希望使用尽可能少的词汇而得到强大的表达性。词汇越少则模型越小、训练速度越快。

对于推断，我们可以将干净的描述转换成一个集，将它的规模打印出来，这样就可以了解我们的数据集词汇量的大小了。

# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
	# build a list of all description strings
	all_desc = set()
	for key in descriptions.keys():
		[all_desc.update(d.split()) for d in descriptions[key]]
	return all_desc

# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))

最后，我们保存图像标识符词典和描述至一个新文本 descriptions.txt，该文件中每行只有一个图像和一个描述。

下面我们定义了 save_doc() 函数，即给出一个包含标识符和描述之间映射的词典和文件名，将该映射保存至文件中。

# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
	lines = list()
	for key, desc_list in descriptions.items():
		for desc in desc_list:
			lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# save descriptions
save_doc(descriptions, 'descriptions.txt')

汇总起来，完整的函数定义如下所示：

import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# extract descriptions for images
def load_descriptions(doc):
	mapping = dict()
	# process lines
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		if len(line) < 2:
			continue
		# take the first token as the image id, the rest as the description
		image_id, image_desc = tokens[0], tokens[1:]
		# remove filename from image id
		image_id = image_id.split('.')[0]
		# convert description tokens back to string
		image_desc = ' '.join(image_desc)
		# create the list if needed
		if image_id not in mapping:
			mapping[image_id] = list()
		# store description
		mapping[image_id].append(image_desc)
	return mapping

def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc_list in descriptions.items():
		for i in range(len(desc_list)):
			desc = desc_list[i]
			# tokenize
			desc = desc.split()
			# convert to lower case
			desc = [word.lower() for word in desc]
			# remove punctuation from each token
			desc = [w.translate(table) for w in desc]
			# remove hanging 's' and 'a'
			desc = [word for word in desc if len(word)>1]
			# remove tokens with numbers in them
			desc = [word for word in desc if word.isalpha()]
			# store as string
			desc_list[i] =  ' '.join(desc)

# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
	# build a list of all description strings
	all_desc = set()
	for key in descriptions.keys():
		[all_desc.update(d.split()) for d in descriptions[key]]
	return all_desc

# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
	lines = list()
	for key, desc_list in descriptions.items():
		for desc in desc_list:
			lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)
# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))
# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))
# save to file
save_descriptions(descriptions, 'descriptions.txt')

运行示例首先打印出加载图像描述的数量（8092）和干净词汇量的规模（8763 个单词）。

Loaded: 8,092
Vocabulary Size: 8,763

最后，把干净的描述写入 descriptions.txt。

查看文件，我们能够看到该描述可用于建模。文件中描述的顺序可能会发生改变。

2252123185_487f21e336 bunch on people are seated in stadium
2252123185_487f21e336 crowded stadium is full of people watching an event
2252123185_487f21e336 crowd of people fill up packed stadium
2252123185_487f21e336 crowd sitting in an indoor stadium
2252123185_487f21e336 stadium full of people watch game
...

开发深度学习模型

本节我们将定义深度学习模型，在训练数据集上进行拟合。本节分为以下几部分：

1. 加载数据。

2. 定义模型。

3. 拟合模型。

4. 完成示例。

加载数据

首先，我们必须加载准备好的图像和文本数据来拟合模型。

我们将在训练数据集中的所有图像和描述上训练数据。训练过程中，我们计划在开发数据集上监控模型性能，使用该性能确定什么时候保存模型至文件。

训练和开发数据集已经预制好，并分别保存在 Flickr_8k.trainImages.txt 和 Flickr_8k.devImages.txt 文件中，二者均包含图像文件名列表。从这些文件名中，我们可以提取图像标识符，并使用它们为每个集过滤图像和描述。

如下所示，load_set() 函数将根据训练或开发集文件名加载一个预定义标识符集。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)

现在，我们可以使用预定义训练或开发标识符集加载图像和描述了。

下面是 load_clean_descriptions() 函数，该函数从给定标识符集的 descriptions.txt 中加载干净的文本描述，并向文本描述列表返回标识符词典。

我们将要开发的模型能够生成给定图像的字幕，一次生成一个单词。先前生成的单词序列作为输入。因此，我们需要一个 first word 来开启生成步骤和一个 last word 来表示字幕生成结束。

我们将使用字符串 startseq 和 endseq 完成该目的。这些标记被添加至加载描述，像它们本身就是加载出的那样。在对文本进行编码之前进行该操作非常重要，这样这些标记才能得到正确编码。

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# create list
			if image_id not in descriptions:
				descriptions[image_id] = list()
			# wrap description in tokens
			desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
			# store
			descriptions[image_id].append(desc)
	return descriptions

接下来，我们可以为给定数据集加载图像特征。

下面定义了 load_photo_features() 函数，该函数加载了整个图像描述集，然后返回给定图像标识符集你感兴趣的子集。

这不是很高效，但是，这可以帮助我们启动，快速运行。

# load photo features
def load_photo_features(filename, dataset):
	# load all features
	all_features = load(open(filename, 'rb'))
	# filter features
	features = {k: all_features[k] for k in dataset}
	return features

我们可以在这里暂停一下，测试目前开发的所有内容。

完整的代码示例如下：

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# create list
			if image_id not in descriptions:
				descriptions[image_id] = list()
			# wrap description in tokens
			desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
			# store
			descriptions[image_id].append(desc)
	return descriptions

# load photo features
def load_photo_features(filename, dataset):
	# load all features
	all_features = load(open(filename, 'rb'))
	# filter features
	features = {k: all_features[k] for k in dataset}
	return features

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))

运行该示例首先在测试数据集中加载 6000 张图像标识符。这些特征之后将用于加载干净描述文本和预计算的图像特征。

Dataset: 6,000
Descriptions: train=6,000
Photos: train=6,000

描述文本在作为输入馈送至模型或与模型预测进行对比之前需要先编码成数值。