Open IE with OLLIE Open IE with OLLIE Max Lotstein Information Extraction Winter 2013 1 Outline First, we will talk about the status quo their problems, and how OLLIE addresses these problems by thinking about relations differently. Then we will talk about how OLLIE does this, and go into detail about its architecture. Then we compare the systems’ ability to extract information and speed. Then we’ll talk about some of the OLLIE’s problems and how it could be improved. 2 Outline 3 ReVerb relation V | V P | V W* P arg1 arg2 Earlier today, and in last week’s session, we have seen open IE systems that have exploited what we know about relations to extract information. These systems differ from each other in many ways, but one important way is how much processing was done to the sentence in order to extract information. Strategies that spend very little time – and are very fast – are called shallow, whereas those that spend a lot of time – and are typically slow – are called deep. One example of a shallow strategy is what ReVerb does. ReVerb begins by trying to identify the relation, and then looks for arguments on either side. It does so, importantly, without a dependency parse but withonly POS tags and an NP chunker. ReVerb, see Syntactic Constraint and Extraction Algorithm. It’s worth noting that the authors of ReVerb were aware that this was a simplification and even enumerated the sorts of relations for which this approach won’t work. 4 WOE, TextRunner* arg1 arg2 nsubj prep_in *NOTE: WOEpos uses this information to train a classifier, as does TextRunner. Only WOEparse uses it during extraction. prep_* Meanwhile, WOE and TextRunner both make use of information in dependency parses to identify paths the connect arguments to relations. For example, in TextRunner, the path between all pairs of noun phrases are considered. I should note that neither WOE-pos nor TextRunner use deep processing during extraction, but rather to train classifiers. It’s also interesting to note that WOEparse very aggressively generalizes from the patterns it observes, by simply dropping the POS tags. We’ll come back to this later. 5 Many Systems, One Product: Tuples arg1 arg2 relation ( ) , , The output of IE systems is the same, regardless of strategy and, just as the strategies, the tuple – sometimes called the triple – reflects simplifying assumptions about the way relations are encoded in text. 6 Extractions 1 2 3 4 Rel/Arg Order Relations have verbs Contiguity Non-Nested Unfortunately, the set of all relations includes many for which these assumptions do not hold. So, what are these assumptions #1 Relations have verbs. #2 Sentence order assumption, arg-relation-arg #3 Contiguousness assumption, distinct (triples) #4 Relations can be nested Let’s see some examples of how these assumptions don’t hold. 7 #1: Relations w/o Verbs “Microsoft co-founder Bill Gates …” (Bill Gates, be co-founder of, Microsoft) So, relations needn’t use verbs, even though obviously many do. Here is an example of one that doesn’t. Any system that only looks for the subjects and objects of verbs will miss this relation. 8 #2: Relation/Argument Order After winning the election, Obama celebrated. (Obama, win, the election) The order of the relations relative to the arguments needn’t always be argument-relation-argument. Here, the first argument follows the relation and the second argument. 9 #3: Non-Contiguous Elements There are plenty of taxis available at Bali airport. (taxis, be available at, Bali airport) A challenging problem for shallow systems that focus on verbs are extractions with non-contiguous relations. Here is one example. 10 #4: Nested Relations Early astronomers believed that the earth is the center of the universe. ((the Earth, be the center of, the universe) AttributedTo believe, Early astronomers) If he makes this shot, Tiger Woods will win the championship. ((Tiger Woods, will win, the championship) ClausalModifier if, he makes this shot) Beyond observing that relations can have more than 2 arguments, the authors of OLLIE also show that relations can modify other relations, recursively. We see in 2 different ways. The first is attribution. Here we see an example in which the truth conditions have nothing to do with the earth, and have solely to do with the early astronomers’ beliefs. The representation is modified by nesting one relation inside another with a tag, denoting attribution, and some additional information. Note that we could do this many times – nesting relations within one another – as in, John said that Mary said that …. The second sort of nested relation occurs when there are a logical connection between one relation and another. For example, we might contrast two elements. In the example here, in fact, the truth of the sentence depends on this condition – if he makes the shot – but that needn’t be the case. There are other relationships that exist – one could cause the other. Of course, if the other thing is, itself, an entire relation, we will lose some structure in this form of representation. The solution is very similar to attribution: nested the relation within a layer that captures the new information. 11 OLLIE uses deep syntactic analysis to extract these new relations, and uses a new form of representation when appropriate. 12 Information Density and Emphasis Bill Gates is the co-founder of Microsoft. Bill Gates is a billionaire. Bill Gates owns a dog named Bucks. Microsoft co-founder, Bill Gates, who is a billionaire, owns a dog named Bucks. vs. We’re beginning with R2A2 to begin to rediscover facts about language and communication from our attempts to extract them. For instance, the tendnecies for arguments to occur in certain ways. 13 Outline 14 Architecture There are two components to OLLIE – an offline learning process and an online extraction component. The learning component creates pattern templates. The pattern templates are features in a dependency parse. These patterns are then applied to extract information. The general idea behind the learning component is to find lots of examples of sentences that assert a certain tuple, that we can be sure to capture all possible ways that information is encoded. 15 Seed Tuples (arg1, relation, arg2)1 (arg1, relation, arg2)2 (arg1, relation, arg2)3 … (arg1, relation, arg2)110k The training process begins with 110,000 tuples extracted using ReVerb extracted from a large Web corpus. 16 Seed Tuple, Example (Obama, win, the election) 17 Training Data (arg1, relation, arg2)n 4M tuple-sentence pairs (40 sent./tup.) = For each tuple in the 110,000, the corpus is searched for sentences that contain the arguments and the relation. The idea is that each of these will assert the relation, and that by having a large number, the algorithm will be able to generalize better. Of course, not every sentence that has the words in the arguments and the relation will actually assert the relation, so a constraint on the dependency path between the terms is imposed – it must be 4 or less. A random sample of 100 of these pairs found 90 did meet the bootstrapping hypothesis, 64 without dependency constraints. The result is 4M sentence-tuple pairs, or about 40 sentences per tuple. 18 Bootstrap, Example Many pundits expect Obama to win the election. (Obama, win, the election) 19 Creating an Open Pattern Extract path Annotate relation node with word and POS Normalize copula {Obama} ↑nsubj↑ {win:postag=VB}↓dobj↓{the election} The picture on the right here is a dependency parse tree. Recall that in such a tree, nodes are words with POS tags and the edges – which are directed – reflect asymmetric relations between words in the sentence, for instance between a noun that performs an action and the action that it performs, there is an nsubj edge. The first thing that we do is extract the path connecting the path and the relation words. Here, the path is highlighted. Then we annotate the relation node with the relation and the POS, resulting in the line of text at the bottom. If there are any instances of the word ‘to be’, in any form, convert them to ‘be.’ The arrows here indicate edges in the dependency parse tree. Notice that this still has English words in it. Thus, our pattern is lexically constrained. Clearly, if we are to generalize to words not in our training set, we’ll need to generalize by lexicalization – that is, removing lexical constraints. The question is, when can we get rid of them? 20 Slot Node A node on the dependency path that isn’t part of the extraction In order to discuss how we answer this question, we need a new concept called the Slot Node. A slot node is a node on the dependency path that isn’t part of the extraction. So, here is a dependency parse of the same sentence we just saw, but I’ve added a slot node between the first argument and the relation and highlighted it red. 21 Can We De-Lexicalize? If all of the following: NO slot node on path Relation node between arguments Prepositionpattern = Prepositiontuple Path has no nn or amod edges Then: syntactic pattern Else: lexical/semantic pattern The procedure for determining whether or not we can de-lexicalize – that is, remove lexical constraints says, if all of the following are true, then yes, we can, and the result is a syntactic pattern. If not, then leave the lexicalizations in for now, we’ll come back to them in a second. Contrast this set of constraints with the lack of constraints in WOEparse’s generalization of patterns. Amod: adjectival modifier. Example, “Sam eats red meat.” Red modifies meat. Nn: noun compound modifier. Example: “Long-term contract”. The hyphens are used to make clear that this is a compound modifier, and not a term contract that is long. 22 Purely Syntactic Patterns Aggressively generalize: Relations, remove lexical constraints Prepositions, convert to {prep_*} Consider sentences: “Michael appeared on Oprah…” “… when Alexander the Great advanced to Babylon.” Both have the pattern: {arg1} ↑nsubj↑ {rel:postag=VBD} ↓{prep_*}↓ {arg2} Note that the patterns do not have any words in them, just POS tags and dependency parse edges. 23 Lexical/Semantic Patterns, Example “Microsoft co-founder Bill Gates…” (Bill Gates, is co-founder of, Microsoft) “Chicago Symphony Orchestra”* (Orchestra, is symphony of, Chicago)* Can we still generalize to unseen words? So, we have patterns with big long lists of lexical constrains. Can we shorten these lists by generalizing about them, and in so doing be able to handle more words than just the ones in our training set? The answer is sometimes. Here we see the pitfall of hasty generalization: the pattern that works for Bill Gates doesn’t work so well for Orchestra. So, under what conditions can we do better than having a big long list? 24 Lexical/Semantic Patterns People = WordNet’s People class Location = WordNet’s Location class L = list of lexical items Ipeople = L ∩ People Ilocation = L ∩ Location If Ipeople (or Ilocation) > 3/4*L: Then: Use Ipeople, drop L (Use Ilocation) Else: Keep L The answer is ‘sometimes’. There are many sophisticated methods for measuring semantic similarity that could have been used, but the authors here instead use WordNet to identify instances of 2 popular classes: People and Locations. If the list of lexical items has a 75% overlap with members of these WordNet classes, then we can discard the list in favor of the class instead. The classes are much greater in size than the lexical lists; person has 6200 elements. The crudeness of this approach is obvious if you consider that a set of lexical constraints might be composed of multiple classes, which we could then take the union of. 25 Some Open Pattern Templates The top 3 are examples of purely syntactic patterns. #s 1- 3 are purely syntactic. #s 4 and 5 are examples of Lexical/Semantic patterns. #4 is an example where the WordNet class person was generalized. #5 is an example with a lexical list, from which there was no generalized form. 26 Open Pattern Template Statistics PatternType Average Rank Frequency Lexical/Semantic 344 515 Purely Syntactic 186 114 Grand Total 629 27 Lexical Constraint Statistics Length: number of lexical items in the regex POS Tags and Frequency (CardNum, JJ, NPs and VP) CD = cardinal # JJ = adjective NNs = types of Noun groups Vs = types of verb groups CD can likely be generalized. The only things in the list were one and two. Small number of constraints might mean that we need a larger training set. 28 Extraction There are only 2 components in the extraction phase: pattern matching and context analysis. 29 Pattern Matching Apply Pattern Template Expand on relevant edges e.g. “election” “the election” (det) Use word order from sentence to make tuple The edge expansion is trying to answer the question of the units of information in a sentence. Perhaps the units are clauses. Unfortunately, DP doesn’t show clauses. 30 Context Analysis Attribution Marked by ccomp edges E.g. “He says that you like to swim” (says, like) Communication/cognition verbs, e.g. ‘believe’ Clausal Modifier: when dependent clause modifies main extraction Marked by advcl “The accident occurred as night fell” (occurred, fell) If, when, although, because … In context analysis, we are trying to detect sentences with attributions and clausal modification. This was the example with the early astronomers who believed that the earth was the center of the universe. The authors of OLLIE identify certain features that seem to capture almost all of the sentences with attribution and clausal modification, but they also describe other sorts of sentences as well. So, for both Attribution and Clausal, we will need a feature and a filter, to remove false positives. For Attribution, we look to see if there is a ccomp edge in the parse tree. CCOMP is clausal complement. ccomp: example “He says that you like to swim” says, like Advcl example “The accident happened as the night was falling” happened, falling An adverbial clause modifier of a VP or S is a clause modifying the verb (temporal clause, consequence, conditional clause, purpose clause, etc.) 31 Demonstration Time 32 Outline 33 Speed: Conflicting Reports System Sent/sec ReportedIn OLLIE 89 OLLIE ReVerb 104 ReVerb TextRunner 662 TextRunner TextRunner 79 ReVerb TextRunner 2727 WOE WOEparse 3 ReVerb WOEparse 88 WOE WOEpos 79 ReVerb WOEpos 2727 WOE Every system we have seen so far has reported a measure of the speed of the system, often using sentences/sec during the online-phase. This assumes that there is a representative set of sentences, but there is – as far as I know – no benchmark set of sentences for IE. Speed is important whenever we want to talk about performing at Internet-scale, as an algorithm that never finishes isn’t useful. Because OLLIE uses a dependency parse, it will be significantly slower than a system that only does POS-tagging and NP-chunking. 34 A Less Precise Consensus TextRunner/ReVerb > WOEpos > WOEparse/OLLIE Fast Slow One thing to note is that OLLIE is parallelizable. I ran OLLIE on my machine and was unable to get to 89 sent/sec. 35 There is a very high correlation between parseTime and extractTime. 36 Precision vs. Yield 300 Sentences, selected randomly from News, Wikipedia and a Biology textbook Hand corrected extractions by multiple humans The authors of OLLIE compare it to WOE and ReVerb. The R2A2 addition paper out after this paper was written, it is not included. For each extraction from each system, 2 human raters determined whether the extraction was stated in or implied by the sentence. Importantly, this is not the same as a Precision-Recall curve. As you can see from the graph, OLLIE has much higher overall area than either of the two systems, though ReVerb’s first few extractions are of significantly higher precision. 37 Comparing Comparisons Requires natural order Confidence Values Requires full set Allows false negative detection Rather than Precision over recall, the authors use Precision over Yield, which assumes that the output of each system is ordered – that is, the tuples are ranked in some way relative to each other. In order to order the output of each system, they use the confidence values that the systems give. I haven’t mentioned anything about OLLIE’s confidence function because the paper doesn’t include much, and it’s not crucial to understanding the importance of this graph. I will say that in my tests, many extractions had the same confidence value, and it’s not clear to me how one would break ties in order to determine yield. 38 Relative Quality OLLIE vs. ReVerb OLLIE, not ReVerb Rel outside args Noun-mediated rels ReVerb, not OLLIE Parser errors OLLIE vs. WOEparse OLLIE, not WOEparse Nouns in relation phrases E.g. ‘plays a role in’ Ill-formed relations Even though OLLIE’s authors didn’t compile the full set of correct extractions, they did examine some of the extractions that OLLIE got that the other systems didn’t. 39 Noun-Mediated Relations “Obama, the president of the US” “Obama, the US president” “US President Obama” >> “Obama is the president of the US” The authors search for 4 different noun-mediated relations and compare the number of extractions by OLLIE and ReVerb and it’s clear that OLLIE is capable of finding many noun-mediated relations that ReVerb cannot find. Furthermore, the number of instances of the relation Obama is the US president in which the relation is a noun-mediated far outnumber instances in which it is verb mediated. 40 Are Semantic Restrictions Important? Yes. The authors of OLLIE briefly explore the effects of the various pieces of OLLIE by testing versions with and without certain features. The first graph compares OLLIE to OLLIE with lexical restrictions but no generalization (LEX) and OLLIE without semantic or lexical restrictions [syn] (ie aggressive generalization of all rules). The patterns used are ONLY those that would be affected by these changes, ie not purely syntactic patterns. It’s clear that these PATTERNS are improved by using restrictions. The reason why precision is so low, compared with the tests vs. other systems is because these are more difficult relations to extract. 41 Is Context Analysis Important? Yes. This graph compares OLLIE to a version without attribution/clausal modification. Clearly, this 42 OLLIE vs. SRL SRL performs well Bad at grammatical complexity OLLIE deals with co-reference better Noun-mediated relations are harder, rarer Union is higher than both: everybody wins! The authors of OLLIE compare OLLIE to a state-of-the-art Semantic Role Labeling system and it’s worthwhile for a few reasons. Firstly, if SRL systems are better at IE than IE systems, despite being designed with a different objective in mind, than perhaps we don’t need IE at all – we can just use SRL. Secondly, their methodology here is more rigorous in some ways: they fully annotate a very small set of sentences for every possible extraction. The table here shows recall for both systems. A number of things are notable. Co-reference is matching things in sentence that are the same, like “Max tried sushi and he didn’t really like it.” He and Max are the same entity. 43 Sources of Error Source of Error % Parser Error 32 Aggressive generalization 18 Incorrect application of lexical pattern 12 Missed Context 13 Limitations of Binary Representation 12 Any system that uses a dependency parser will fail when the parser fails. The generalization in OLLIE was meant to improve upon WOEparse’s system, which was even more aggressive. Apparently, there need to be more conditions than the 4 that were used here. The limits of binary representation are a problem that all IE systems face. Interestingly, it’s not a problem faced by SRL systems, which have a much more nuanced representation. Incorrect application of lexical pattern – the lexical lists contain POS but sometimes, POS tags are wrong. We’ll see an example of that shortly. 44 Illuminating Errors The rotation of the planet causes it to take the shape of an oblate spheroid; that is, it is flattened at the poles and bulges at the equator.1 (it, is flattened at, the poles and bulges)* Saturn is the only planet of the Solar System that is less dense than water--about 30% less.2 (Saturn, is the only planet of, the Solar System)* I shot the man with the gun. (I, shot the man with, the gun)* (I, shot, the man)* http://en.wikipedia.org/wiki/Saturn1 http://simple.wikipedia.org/wiki/Saturn2 So, I did my own tests by taking two wikipedia articles on the same topic, one from normal English and the other from simple English. For those that haven’t explore the simple English wikipedia, the sentences are less complex. I’ll not go into the effect of normal simple English. This first example shows a couple interesting errors. Firstly, there is a coreference error – “it” refers to ‘the planet’. Secondly, the part of speech of ‘bulges’ is incorrectly interpreted as being a noun when it’s being used here as a verb. For the non-native speakers, the and conjunction here makes a small list that starts with the word ‘is’ – it is flattened, it bulges. Conjunctions, and in general, secondary arguments, are not deal with by OLLIE. This error is interesting in that it is a consequence of the parser that I used. OLLIE is distributed with MaltParser, a particularly fast parser that was used during training. OLLIE does not include the parser that was used during testing – which is Stanford’s parser. Stanford’s parser recognizes that ‘bulges’ is a verb in this sentence. The second example demonstrates that the word ‘that’ has many functions only one of which is as a clausal modifier. Interestingly, in spoken English, these have different pronunciations. Some better analysis will be needed to distinguish between the two cases. The final example demonstrates OLLIE’s problem with arguments. Both of these extractions struggle because they are looking for a single second argument. 45 Two Observations About Language In English, words can act in groups I like ice cream. Do you (like ice cream)? I like ice cream and hate bananas. I said I would hit Fred and hit Fred I did. Words also depend on other words by Verbs have agents, objects, etc. I (subj) throw (verb) the (det) ball (obj) Neither approach is perfect. Phrase Driven Grammar Dependency Grammar 46 Outline First, we will talk about the status quo their problems, and how OLLIE addresses these problems by thinking about relations differently. Then we will talk about how OLLIE does this, and go into detail about its architecture. Then we compare the systems’ ability to extract information. Then we’ll talk about some of the OLLIE’s problems and how it could be improved. 47 Conclusions, Methodology How big must a sample be in order to be representative? Bootstrapping hypothesis, only 100 50 sentences in SRL comparison ‘Gold standard’ annotation (support recall) Potentially more reliable inter-system comparison “Hey look, our system is better! What are the odds!” Better false negative detection Ahem … grad students are cheap. Conclusion, Theoretical Generalization Techniques Syntactic: a bit too aggressive Lexical/Semantic: a bit too tame Many other options. See Angeli, Gabor, and Manning 2013 OLLIE lives and dies by its parser Responsible for sig. % of errors Accounts of sig. % of time Relations still assumed binary Many are n-ary, have optional arguments See KrakeN, ClausIE Contextual Relations are limited, flawed What really are relations, anyway? 49 Our Work Isn’t Done Words are not bags of characters: Opposites, synonyms, entailment, classes … Sentences are not bags of words: Syntactic structure, semantic frames …. Are documents bags of sentences? Coreference disambiguation 50 References Schmitz, Michael, et al. "Open language learning for information extraction." Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012. OLLIE Readme, Accessed on 12-3-2013. URL https://github.com/knowitall/ollie Etzioni, Oren, et al. "Open information extraction: The second generation." Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume One. AAAI Press, 2011. Fader, Anthony, Stephen Soderland, and Oren Etzioni. "Identifying relations for open information extraction." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011. Banko, Michele, et al. "Open Information Extraction from the Web." IJCAI. Vol. 7. 2007. Wu, Fei, and Daniel S. Weld. "Open information extraction using Wikipedia." Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010. De Marneffe, Marie-Catherine, and Christopher D. Manning. "Stanford typed dependencies manual." URL http://nlp. stanford. edu/software/dependencies manual. pdf (2008). “Dependency Grammar.” Advanced Natural Language Processing, University of Edinburgh. URL http://www.inf.ed.ac.uk/teaching/courses/anlp/slides/anlp15.pdf Angeli, Gabor, and Christopher D. Manning. "Philosophers are mortal: Inferring the truth of unseen facts." CoNLL-2013 (2013): 133. Akbik, Alan, and Alexander Löser. "Kraken: N-ary facts in open information extraction." Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction. Association for Computational Linguistics, 2012. Del Corro, Luciano, and Rainer Gemulla. "ClausIE: clause-based open information extraction." Proceedings of the 22nd international conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2013. Wikipedia, That, http://en.wikipedia.org/wiki/That (as of Dec. 4, 2013). 51 52 Confidence Function Top Positive Features: nn edges in pattern 0.91 rel contains verb 0.48 openparse confidence 0.43 Top Negative Features: if right before arg1 -1.22 vacuous extraction -0.64 semantic constraints in pattern -0.43