手把手丨用TensorFlow开发问答系统

大数据文摘 · 公众号 · 大数据 · 2017-11-29 12:26

正文

请到「今天看啥」查看全文

def sentence2sequence




    
(sentence):

“””

– Turns an input paragraph into an (m,d) matrix,

where n is the number of tokens in the sentence

and d is the number of dimensions each word vector has.

TensorFlow doesn’t need to be used here, as simply

turning the sentence into a sequence based off our

mapping does not need the computational power that

TensorFlow provides. Normal Python suffices for this task.

“””

tokens = sentence.strip(‘”(),-‘).lower().split(” “)

rows = []

words = []

#Greedy search for tokens

for token in tokens:

i = len(token)

while len(token) > 0:

word = token[:i]

if word in glove_wordmap:

rows.append(glove_wordmap[word])

words.append(word)

token = token[i:]

i = len(token)

continue

else:

i = i-1

if i == 0:

# word OOV

# https://arxiv.org/pdf/1611.01436.pdf

rows.append(fill_unk(token))

words.append(token)

break

return np.array(rows), words

现在我们可以把每个问题需要的数据给打包起来了，包括上下文、问题和答案的词向量。在bAbI里，上下文被我们定义成了带有序号的句子。用contextualize函数可以完成这个任务。问题和答案都在同一行里，用tab符分割开。因此在一行里我们可以使用tab符作为区分问题和答案的标记。当序号被重置后，未来的问题将会指向是新的上下文（注意：通常对于一个上下文会有多个问题）。答案里还有另外一个我们会保留下来但不用的信息：答案对应的句子的序号。在我们的系统里，神经网络将会自己学习用来回答问题的句子。

def contextualize(set_file):

“””

Read in the dataset of questions and build question+answer -> context sets.

Output is a list of data points, each of which is a 7-element tuple containing:

The sentences in the context in vectorized form.

The sentences in the context as a list of string tokens.

The question in vectorized form.

The question as a list of string tokens.

The answer in vectorized form.

The answer as a list of string tokens.

A list of numbers for supporting statements, which is currently unused.

“””

data = []

context = []

with open(set_file, “r”, encoding=”utf8″) as train:

for line in train:

l, ine = tuple(line.split(” “, 1))

# Split the line numbers from the sentences they refer to.

if l is “1”:

# New contexts always start with 1,

# so this is a signal to reset the context.

context = []

if “\t” in ine:

# Tabs are the separator between questions and answers,

# and are not present in context statements.

question, answer, support = tuple(ine.split(“\t”))

data.append((tuple(zip(*context))+

sentence2sequence(question)+

sentence2sequence(answer)+

([int(s) for s in support.split()],)))

# Multiple questions may refer to the same context, so we don’t reset it.

else:

# Context sentence.

context.append(sentence2sequence(ine[:-1]))

return data

train_data = contextualize(train_set_post_file)

test_data = contextualize(test_set_post_file)

final_train_data = []

def finalize(data):

“””

Prepares data generated by contextualize() for use in the network.

“””

final_data = []

for cqas in train_data:

contextvs, contextws, qvs, qws, avs, aws, spt = cqas

lengths = itertools.accumulate(len(cvec) for cvec in contextvs)

context_vec = np.concatenate(contextvs)

context_words = sum(contextws,[])

# Location markers for the beginnings of new sentences.

sentence_ends = np.array(list(lengths))

final_data.append((context_vec, sentence_ends, qvs, spt, context_words, cqas, avs, aws))

return np.array(final_data)

final_train_data = finalize(train_data)

final_test_data = finalize(test_data)

定义超参数

到这里，我们已经完全准备好了所需的训练和测试数据。下面的任务就是构建用来理解数据的神经网络。让我们从清除TensorFlow的默认计算图开始，从而能让我们在修改了一些东西后再次运行网络。

tf.reset_default_graph()

这里是网络的开始，因此让我们在这里定义所有需要的常量。我们叫它们“超参数”，因为它们定义了网络的结构和训练的方法。

# Hyperparameters

# The number of dimensions used to store data passed between recurrent layers in the network.

recurrent_cell_size = 128

# The number of dimensions in our word vectorizations.

D = 50

# How quickly the network learns. Too high, and we may run into numeric instability

# or other issues.

learning_rate = 0.005

# Dropout probabilities. For a description of dropout and what these probabilities are,

# see Entailment with TensorFlow.

input_p, output_p = 0.5, 0.5

# How many questions we train on at a time.

batch_size = 128

# Number of passes in episodic memory. We’ll get to this later.

passes = 4

# Feed Forward layer sizes: the number of dimensions used to store data passed from feed-forward layers.

ff_hidden_size = 256

weight_decay = 0.00000001

# The strength of our regularization. Increase to encourage sparsity in episodic memory,

# but makes training slower. Don’t make this larger than leraning_rate.

training_iterations_count = 400000

# How many questions the network trains on each time it is trained.

# Some questions are counted multiple times.

display_step = 100

# How many iterations of training occur before each validation check.

网络架构

有了这些超参数，让我们定义网络的架构。这个架构大致能被分成4个模块，具体描述请见这篇《Ask Me Anything: Dynamic Memory Networks for Natural Language Processing》。

网络里定义了一个循环层，能基于文本里的其他的信息被动态地定义，因此叫动态记忆网络（DMN，Dynamic Memory Network）。DMN大致是基于人类是如何试图去回答一个阅读理解类型的问题的理解。首先，人类会读取一段上下文，并在记忆里创建一些事实内容。基于这些记住的内容，他们再去读问题，并再次查看上下文，特别是去寻找和问题相关的答案，并把问题和每个事实去比对。

有时候，一个事实会把我们引向另外一个事实。在bAbI数据集里，神经网络可能是想找到一个足球的位置。它也许会搜索句子里和足球相关的内容，并发现John是最后一个接触足球的人。然后搜索和John相关的句子，发现John曾经出现在卧室和门厅里。一旦它意识到John最后是出现在门厅里，它就可以有信心地回答这个问题，指出足球是在门厅里。

图3 神经网络里的4个模型，它们组合在一起来回答bAbI的问题。在每个片段里，新的事实被注意到，它们能帮助找到答案。Kumar注意到这个网络不正确地给句子2分配了权重，但这也合理，因为John曾经在那里，尽管此时他没有足球。来源：Ankit Kumar等，授权使用

输入模块

输入模块是我们的动态记忆网络用来得到答案的4个模块的第一个。它包括一个带有门循环单元（GRU，Gated Recurrent Unit，在TensorFlow里的tf.contrib.nn.GRUCell）的输入通道，让数据通过来收集证据片段。每个片段的证据或是事实都对应这上下文的单个句子，并由这个时间片的输出所代表。这就要求一些非TensorFlow的预处理，从而能获取句子的结尾并把这个信息送给TensorFlow来用于后面的模块。

我们会在后面训练的时候处理这些额外的过程。我们会使用TensorFlow的gather_nd来处理数据从而选择相应的输出。gather_nd功能是一个非常有用的工具。我建议你仔细看看它的API文档来学习它是如何工作的。

# Input Module

# Context: A [batch_size, maximum_context_length, word_vectorization_dimensions] tensor

# that contains all the context information.

context = tf.placeholder(tf.float32, [None, None, D], “context”)

context_placeholder = context # I use context as a variable name later on

# input_sentence_endings: A [batch_size, maximum_sentence_count, 2] tensor that

# contains the locations of the ends of sentences.

input_sentence_endings = tf.placeholder(tf.int32, [None, None, 2], “sentence”)

# recurrent_cell_size: the number of hidden units in recurrent layers.

input_gru = tf.contrib.rnn.GRUCell(recurrent_cell_size)

# input_p: The probability of maintaining a specific hidden input unit.

# Likewise, output_p is the probability of maintaining a specific hidden output unit.

gru_drop = tf.contrib.rnn.DropoutWrapper(input_gru, input_p, output_p)

# dynamic_rnn also returns the final internal state. We don’t need that, and can

# ignore the corresponding output (_).

input_module_outputs, _ = tf.nn.dynamic_rnn(gru_drop, context, dtype=tf.float32, scope = “input_module”)

# cs: the facts gathered from the context.

cs = tf.gather_nd(input_module_outputs, input_sentence_endings)

# to use every word as a fact, useful for tasks with one-sentence contexts

s = input_module_outputs

问题模块

问题模块是第二个模块，也可是说是最简单的一个。它包括另外一个GRU的通道。这次是处理问题的文本。不再是找证据，我们就是简单地进入结束状态，因为数据集里的问题肯定就是一个句子。

# Question Module

# query: A [batch_size, maximum_question_length, word_vectorization_dimensions] tensor