Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/moses-smt/vowpal_wabbit.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/python
diff options
context:
space:
mode:
authorHal Daume III <me@hal3.name>2014-09-27 23:31:27 +0400
committerHal Daume III <me@hal3.name>2014-09-27 23:31:27 +0400
commit4d18a908b8db6ef91763713f2034f15bb9caed6a (patch)
treed967b530ba5f6cd69660284d05b077915793e1f9 /python
parent573ffd4c1eece2941d406caf702b7ca239f7b37b (diff)
updated search library, added notebook
Diffstat (limited to 'python')
-rw-r--r--python/Learning_to_Search.ipynb342
-rw-r--r--python/test_search.py4
2 files changed, 344 insertions, 2 deletions
diff --git a/python/Learning_to_Search.ipynb b/python/Learning_to_Search.ipynb
new file mode 100644
index 00000000..c7599a2e
--- /dev/null
+++ b/python/Learning_to_Search.ipynb
@@ -0,0 +1,342 @@
+{
+ "metadata": {
+ "name": "",
+ "signature": "sha256:29be923ae5d4092090e9acc10b49c66bfddd0ca93b89a6f0f6faf84c15c227f5"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+ {
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This tutorial walks you through writing learning to search code using the VW python interface. Once you've completed this, you can graduate to the C++ version, which will be faster for the computer but more painful for you.\n",
+ "\n",
+ "The \"learning to search\" paradigm solves problems that look like the following. You have a sequence of decisions to make. At the end of making these decisions, the world tells you how bad your decisions were. You want to condition later decisions on earlier decisions. But thankfully, at \"training time,\" you have access to an *oracle* that will tell you the right answer."
+ ]
+ },
+ {
+ "cell_type": "heading",
+ "level": 1,
+ "metadata": {},
+ "source": [
+ "A basic part of speech tagger"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's start with a simple example: sequence labeling for Part of Speech tagging. The goal is to take a sequence of words (\"the monster ate a big sandwich\") and label them with their parts of speech (in this case: Det Noun Verb Det Adj Noun).\n",
+ "\n",
+ "We will choose to solve this problem with left-to-right search. I.e., we'll label the first word, then the second then the third and so on.\n",
+ "\n",
+ "For any vw project in python, we have to start by importing the pyvw library:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "import pyvw"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 24
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now, let's define our data first. We'll do this first by defining the labels (one annoying thing is that labels in vw have to be integers):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "DET = 1\n",
+ "NOUN = 2\n",
+ "VERB = 3\n",
+ "ADJ = 4\n",
+ "my_dataset = [ [(DET , 'the'),\n",
+ " (NOUN, 'monster'),\n",
+ " (VERB, 'ate'),\n",
+ " (DET , 'a'),\n",
+ " (ADJ , 'big'),\n",
+ " (NOUN, 'sandwich')],\n",
+ " [(DET , 'the'),\n",
+ " (NOUN, 'sandwich'),\n",
+ " (VERB, 'was'),\n",
+ " (ADJ , 'tasty')],\n",
+ " [(NOUN, 'it'),\n",
+ " (VERB, 'ate'),\n",
+ " (NOUN, 'it'),\n",
+ " (ADJ , 'all')] ]\n",
+ "print my_dataset[2]"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "stream",
+ "stream": "stdout",
+ "text": [
+ "[(2, 'it'), (3, 'ate'), (2, 'it'), (4, 'all')]\n"
+ ]
+ }
+ ],
+ "prompt_number": 25
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here we have an example of a (correctly) tagged sentence.\n",
+ "\n",
+ "Now, we need to write the structured prediction code. To do this, we have to write a new class that derives from the `pyvw.SearchTask` class.\n",
+ "\n",
+ "This class *must* have two functions: `__init__` and `_run`.\n",
+ "\n",
+ "The initialization function takes three arguments (plus `self`): a vw object (`vw`), a search object (`sch`), and the number of actions (`num_actions`) that this object has been initialized with. Within the initialization function, we must first initialize the parent class, and then we can set whatever options we want via `sch.set_options(...)`. Of course we can also do whatever additional initialization we like.\n",
+ "\n",
+ "The `_run` function executes the sequence of decisions on a given input. The input will be of whatever type our data is (so, in the above example, it will be a list of (label,word) pairs).\n",
+ "\n",
+ "Here is a basic implementation of sequence labeling:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "class SequenceLabeler(pyvw.SearchTask):\n",
+ " def __init__(self, vw, sch, num_actions):\n",
+ " pyvw.SearchTask.__init__(self, vw, sch, num_actions)\n",
+ " sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )\n",
+ "\n",
+ " def _run(self, sentence):\n",
+ " output = []\n",
+ " for n in range(len(sentence)):\n",
+ " pos,word = sentence[n]\n",
+ " with self.vw.example({'w': [word]}) as ex:\n",
+ " pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=(n,'p'))\n",
+ " output.append(pred)\n",
+ " return output"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 26
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's unpack this a bit.\n",
+ "\n",
+ "The `__init__` function is simple. It first calls the parent initializer and then sets some options. The options is sets are two things designed to make the programmer's life easier. The first is `AUTO_HAMMING_LOSS`. Remember earlier we said that when the sequence of decision is made, you have to say how bad it was? This says that we want this to be computed automatically by comparing the individual decisions to the oracle decisions, and defining the loss to be the sum of incorrect decisions.\n",
+ "\n",
+ "The second is `AUTO_CONDITION_FEATURES`. This is a bit subtler. Later in the `_run` function, we will say that the label of the `n`th word depends on the label of the `n-1`th word. In order to get the underlying classifier to *pay attention* to that conditioning, we need to add features. We could do that manually (we'll do this later) or we can ask vw to do it automatically for us. For simplicity, we choose the latter.\n",
+ "\n",
+ "The `_run` function takes a sentence (list of pos/word pairs) as input. We loop over each word position `n` in the sentence and extract the `pos,word` pair. We then construct a VW example that consists of a single feature (the `word`) in the 'w' namespace. Given that example `ex`, we make a search-based prediction by calling `self.sch.predict(...)`. This is a fairly complicated function that takes a number of arguments. Here, we are calling it with the following:\n",
+ "\n",
+ " - `examples=ex`: This tells the predictor what features to use.\n",
+ " - `my_tag=n+1`: In general, we want to condition the prediction of the `n`th label on the `n-1`th label. In order to do this, we have to give each prediction a \"name\" so that we can refer back to it in the future. This name needs to be an integer `>= 1`. So we'll call the first word `1`, the second word `2`, and so on. It has to be `n+1` and not `n` because of the 1-based requirement.\n",
+ " - `oracle=pos`: As mentioned before, on training data, we need to tell the system what the \"true\" (or \"best\") decision is at this point in time. Here, it is the given part of speech label.\n",
+ " - `condition=(n,'p')`: This says that this prediction depends on the output of whichever-prediction-was-called-`n`, and that the \"nature\" of that condition is called 'p' (for \"predecessor\" in this case, though this is entirely up to you)\n",
+ "\n",
+ "Now, we're ready to train the model. We do this in three steps. First, we initialize a vw object, telling it that we have a `--search` task with 4 labels, second that the specific type of `--search_task` is `hook` (you will always use the `hook` task) and finally that we want it to be quiet and use a larger `ring_size` (you can ignore the `ring_size` for now)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "vw = pyvw.vw(\"--search 4 --quiet --search_task hook --ring_size 1024\")"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 27
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, we need to initialize the search task. We use the `vw.init_search_task` function to do this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "sequenceLabeler = vw.init_search_task(SequenceLabeler)"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 28
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Finally, we can train on the dataset we defined earlier, using `sequenceLabeler.learn` (the `.learn` function is inherited from the `pyvw.SearchTask` class). The `.learn` function takes any iterator over data. Whatever type of data it iterates over is what it will feed to your `_run` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "sequenceLabeler.learn(my_dataset.__iter__)"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 29
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Of course, we want to see if it's learned anything. So let's create a single test example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "test_example = [ (0,w) for w in \"the sandwich ate a monster\".split() ]\n",
+ "print test_example"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "stream",
+ "stream": "stdout",
+ "text": [
+ "[(0, 'the'), (0, 'sandwich'), (0, 'ate'), (0, 'a'), (0, 'monster')]\n"
+ ]
+ }
+ ],
+ "prompt_number": 30
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We've used `0` as the labels so you can be sure that vw isn't cheating and it's actually making predictions:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "out = sequenceLabeler.predict(test_example)\n",
+ "print out"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "stream",
+ "stream": "stdout",
+ "text": [
+ "[1, 2, 3, 1, 2]\n"
+ ]
+ }
+ ],
+ "prompt_number": 31
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If we look back at our POS tag definitions, this is DET NOUN VERB DET NOUN, which is indeed correct!"
+ ]
+ },
+ {
+ "cell_type": "heading",
+ "level": 1,
+ "metadata": {},
+ "source": [
+ "Removing the AUTO features"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the above example we used both AUTO_HAMMING_LOSS and AUTO_CONDITION_FEATURES. To make more explicit what these are doing, let's rewrite our `SequenceLabeler` class without them! Here's a version that gets rid of both simultaneously. It is only modestly more complex:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "class SequenceLabeler2(pyvw.SearchTask):\n",
+ " def __init__(self, vw, sch, num_actions):\n",
+ " pyvw.SearchTask.__init__(self, vw, sch, num_actions)\n",
+ "\n",
+ " def _run(self, sentence):\n",
+ " output = []\n",
+ " loss = 0.\n",
+ " for n in range(len(sentence)):\n",
+ " pos,word = sentence[n]\n",
+ " prevPred = output[n-1] if n > 0 else '<s>'\n",
+ " with self.vw.example({'w': [word], 'p': [prevPred]}) as ex:\n",
+ " pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=(n,'p'))\n",
+ " output.append(pred)\n",
+ " if pred != pos:\n",
+ " loss += 1.\n",
+ " self.sch.loss(loss)\n",
+ " return output\n",
+ " \n",
+ "sequenceLabeler2 = vw.init_search_task(SequenceLabeler2)\n",
+ "sequenceLabeler2.learn(my_dataset.__iter__)\n",
+ "print sequenceLabeler2.predict( [(0,w) for w in \"the sandwich ate a monster\".split()] )"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "stream",
+ "stream": "stdout",
+ "text": [
+ "[1, 2, 3, 1, 2]\n"
+ ]
+ }
+ ],
+ "prompt_number": 32
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If executed correctly, this should have printed `[1, 2, 3, 1, 2]`.\n",
+ "\n",
+ "There are essentially two things that changed here. In order to get rid of AUTO_HAMMING_LOSS, we had to keep track of how many errors the predictor had made. This is done by checking whether `pred != pos` inside the inner loop, and then at the end calling `self.sch.loss(loss)` to tell the search procedure how well we did.\n",
+ "\n",
+ "In order to get rid of AUTO_CONDITION_FEATURES, we need to explicitly add the previous prediction as features to the example we are predicting with. Here, we've done this by extracting the previous prediction (`prevPred`) and explicitly adding it as a feature (in the 'p' namespace) during the example construction.\n",
+ "\n",
+ "**Important Note:** even though we're not using AUTO_CONDITION_FEATURES, we *still* must tell the search process that this prediction depends on the previous prediction. We need to do this because the learning algorithm automatically memoizes certain computations, and so it needs to know that, when memoizing, to remember that this prediction *might have been different* if a previous decision were different."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 32
+ }
+ ],
+ "metadata": {}
+ }
+ ]
+} \ No newline at end of file
diff --git a/python/test_search.py b/python/test_search.py
index 05ca352b..89134460 100644
--- a/python/test_search.py
+++ b/python/test_search.py
@@ -20,10 +20,10 @@ class SequenceLabeler(pyvw.SearchTask):
def _run(self, sentence): # it's called _run to remind you that you shouldn't call it directly!
output = []
for n in range(len(sentence)):
- tag,word = sentence[n]
+ pos,word = sentence[n]
# use "with...as..." to guarantee that the example is finished properly
with self.vw.example({'w': [word]}) as ex:
- pred = self.sch.predict(ex, n+1, tag, (n,'p'))
+ pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=(n,'p'))
output.append(pred)
return output