正文
我们将用以下方式清洗文本,以减少需要处理的词汇量:
-
所有单词全部转换成小写。
-
移除所有标点符号。
-
移除所有少于或等于一个字符的单词(如 a)。
-
移除所有带数字的单词。
下面定义了 clean_descriptions() 函数:给出描述的图像标识符词典,遍历每个描述,清洗文本。
import string
def clean_descriptions(descriptions):
# prepare translation table for removing punctuation
table = str.maketrans('', '', string.punctuation)
for key, desc_list in descriptions.items():
for i in range(len(desc_list)):
desc = desc_list[i]
# tokenize
desc = desc.split()
# convert to lower case
desc = [word.lower() for word in desc]
# remove punctuation from each token
desc = [w.translate(table) for w in desc]
# remove hanging 's' and 'a'
desc = [word for word in desc if len(word)>1]
# remove tokens with numbers in them
desc = [word for word in desc if word.isalpha()]
# store as string
desc_list[i] = ' '.join(desc)
# clean descriptions
clean_descriptions(descriptions)
清洗后,我们可以总结词汇量。
理想情况下,我们希望使用尽可能少的词汇而得到强大的表达性。词汇越少则模型越小、训练速度越快。
对于推断,我们可以将干净的描述转换成一个集,将它的规模打印出来,这样就可以了解我们的数据集词汇量的大小了。
# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
# build a list of all description strings
all_desc = set()
for key in descriptions.keys():
[all_desc.update(d.split()) for d in descriptions[key]]
return all_desc
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))
最后,我们保存图像标识符词典和描述至一个新文本 descriptions.txt,该文件中每行只有一个图像和一个描述。
下面我们定义了 save_doc() 函数,即给出一个包含标识符和描述之间映射的词典和文件名,将该映射保存至文件中。
# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
lines = list()
for key, desc_list in descriptions.items():
for desc in desc_list:
lines.append(key + ' ' + desc)
data = '\n'.join(lines)
file = open(filename, 'w')
file.write(data)
file.close()
# save descriptions
save_doc(descriptions, 'descriptions.txt')
汇总起来,完整的函数定义如下所示:
import string
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# extract descriptions for images
def load_descriptions(doc):
mapping = dict()
# process lines
for line in doc.split('\n'):
# split line by white space
tokens = line.split()
if len(line) < 2:
continue
# take the first token as the image id, the rest as the description
image_id, image_desc = tokens[0], tokens[1:]
# remove filename from image id
image_id = image_id.split('.')[0]
# convert description tokens back to string
image_desc = ' '.join(image_desc)
# create the list if needed
if image_id not in mapping:
mapping[image_id] = list()
# store description
mapping[image_id].append(image_desc)
return mapping
def clean_descriptions(descriptions):
# prepare translation table for removing punctuation
table = str.maketrans('', '', string.punctuation)
for key, desc_list in descriptions.items():
for i in range(len(desc_list)):
desc = desc_list[i]
# tokenize
desc = desc.split()
# convert to lower case
desc = [word.lower() for word in desc]
# remove punctuation from each token
desc = [w.translate(table) for w in desc]
# remove hanging 's' and 'a'
desc = [word for word in desc if len(word)>1]
# remove tokens with numbers in them
desc = [word for word in desc if word.isalpha()]
# store as string
desc_list[i] = ' '.join(desc)
# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
# build a list of all description strings
all_desc = set()
for key in descriptions.keys():
[all_desc.update(d.split()) for d in descriptions[key]]
return all_desc
# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
lines = list()
for key, desc_list in descriptions.items():
for desc in desc_list:
lines.append(key + ' ' + desc)
data = '\n'.join(lines)
file = open(filename, 'w')
file.write(data)
file.close()
filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)
# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))
# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))
# save to file
save_descriptions(descriptions, 'descriptions.txt')
运行示例首先打印出加载图像描述的数量(8092)和干净词汇量的规模(8763 个单词)。
Loaded: 8,092
Vocabulary Size: 8,763
最后,把干净的描述写入 descriptions.txt。
查看文件,我们能够看到该描述可用于建模。文件中描述的顺序可能会发生改变。
2252123185_487f21e336 bunch on people are seated in stadium
2252123185_487f21e336 crowded stadium is full of people watching an event
2252123185_487f21e336 crowd of people fill up packed stadium
2252123185_487f21e336 crowd sitting in an indoor stadium
2252123185_487f21e336 stadium full of people watch game
...
开发深度学习模型
本节我们将定义深度学习模型,在训练数据集上进行拟合。本节分为以下几部分:
1. 加载数据。
2. 定义模型。
3. 拟合模型。
4. 完成示例。
加载数据
首先,我们必须加载准备好的图像和文本数据来拟合模型。
我们将在训练数据集中的所有图像和描述上训练数据。训练过程中,我们计划在开发数据集上监控模型性能,使用该性能确定什么时候保存模型至文件。
训练和开发数据集已经预制好,并分别保存在 Flickr_8k.trainImages.txt 和 Flickr_8k.devImages.txt 文件中,二者均包含图像文件名列表。从这些文件名中,我们可以提取图像标识符,并使用它们为每个集过滤图像和描述。
如下所示,load_set() 函数将根据训练或开发集文件名加载一个预定义标识符集。
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# load a pre-defined list of photo identifiers
def load_set(filename):
doc = load_doc(filename)
dataset = list()
# process line by line
for line in doc.split('\n'):
# skip empty lines
if len(line) < 1:
continue
# get the image identifier
identifier = line.split('.')[0]
dataset.append(identifier)
return set(dataset)
现在,我们可以使用预定义训练或开发标识符集加载图像和描述了。
下面是 load_clean_descriptions() 函数,该函数从给定标识符集的 descriptions.txt 中加载干净的文本描述,并向文本描述列表返回标识符词典。
我们将要开发的模型能够生成给定图像的字幕,一次生成一个单词。先前生成的单词序列作为输入。因此,我们需要一个 first word 来开启生成步骤和一个 last word 来表示字幕生成结束。
我们将使用字符串 startseq 和 endseq 完成该目的。这些标记被添加至加载描述,像它们本身就是加载出的那样。在对文本进行编码之前进行该操作非常重要,这样这些标记才能得到正确编码。
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
# load document
doc = load_doc(filename)
descriptions = dict()
for line in doc.split('\n'):
# split line by white space
tokens = line.split()
# split id from description
image_id, image_desc = tokens[0], tokens[1:]
# skip images not in the set
if image_id in dataset:
# create list
if image_id not in descriptions:
descriptions[image_id] = list()
# wrap description in tokens
desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
# store
descriptions[image_id].append(desc)
return descriptions
接下来,我们可以为给定数据集加载图像特征。
下面定义了 load_photo_features() 函数,该函数加载了整个图像描述集,然后返回给定图像标识符集你感兴趣的子集。
这不是很高效,但是,这可以帮助我们启动,快速运行。
# load photo features
def load_photo_features(filename, dataset):
# load all features
all_features = load(open(filename, 'rb'))
# filter features
features = {k: all_features[k] for k in dataset}
return features
我们可以在这里暂停一下,测试目前开发的所有内容。
完整的代码示例如下:
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# load a pre-defined list of photo identifiers
def load_set(filename):
doc = load_doc(filename)
dataset = list()
# process line by line
for line in doc.split('\n'):
# skip empty lines
if len(line) < 1:
continue
# get the image identifier
identifier = line.split('.')[0]
dataset.append(identifier)
return set(dataset)
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
# load document
doc = load_doc(filename)
descriptions = dict()
for line in doc.split('\n'):
# split line by white space
tokens = line.split()
# split id from description
image_id, image_desc = tokens[0], tokens[1:]
# skip images not in the set
if image_id in dataset:
# create list
if image_id not in descriptions:
descriptions[image_id] = list()
# wrap description in tokens
desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
# store
descriptions[image_id].append(desc)
return descriptions
# load photo features
def load_photo_features(filename, dataset):
# load all features
all_features = load(open(filename, 'rb'))
# filter features
features = {k: all_features[k] for k in dataset}
return features
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
运行该示例首先在测试数据集中加载 6000 张图像标识符。这些特征之后将用于加载干净描述文本和预计算的图像特征。
Dataset: 6,000
Descriptions: train=6,000
Photos: train=6,000
描述文本在作为输入馈送至模型或与模型预测进行对比之前需要先编码成数值。