371da381856eae48b28c6e0fc3670b63019d.pdf

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Transcript
  Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 2019–2027,October 25-29, 2014, Doha, Qatar. c  2014 Association for Computational Linguistics Classifying Idiomatic and Literal Expressions Using Topic Models andIntensity of Emotions Jing Peng & Anna Feldman Computer Science/LinguisticsMontclair State UniversityMontclair, New Jersey, USA { pengj,feldmana } @mail.montclair.edu Ekaterina Vylomova Computer ScienceBauman State Technical UniversityMoscow, Russia evylomova@gmail.com Abstract We describe an algorithm for automatic clas-sification of idiomatic and literal expressions.Our starting point is that words in a given textsegment, such as a paragraph, that are high-ranking representatives of a common topic of discussion are less likely to be a part of an id-iomatic expression. Our additional hypothesisis that contexts in which idioms occur, typi-cally, are more affective and therefore, we in-corporate a simple analysis of the intensity of the emotions expressed by the contexts. Weinvestigate the bag of words topic represen-tation of one to three paragraphs containingan expression that should be classified as id-iomatic or literal (a target phrase). We ex-tract topics from paragraphs containing idiomsand from paragraphs containing literals us-ing an unsupervised clustering method, LatentDirichlet Allocation (LDA) (Blei et al., 2003).Since idiomatic expressions exhibit the prop-erty of non-compositionality, we assume thatthey usually present different semantics thanthe words used in the local topic. We treatidioms as semantic outliers, and the identifi-cation of a semantic shift as outlier detection.Thus, this topic representation allows us to dif-ferentiate idioms from literals using local se-mantic contexts. Our results are encouraging. 1 Introduction The definition of what is literal and figurative is stillobject of debate. Ariel (2002) demonstrates that lit-eral and non-literal meanings cannot always be distin-guished from each other. Literal meaning is srcinallyassumed to be conventional, compositional, relativelycontext independent, and truth conditional. The prob-lem is that the boundary is not clear-cut, some figu-rative expressions are compositional – metaphors andmany idioms; others are conventional – most of the id-ioms. Idioms present great challenges for many Natu-ral Language Processing (NLP) applications. They canviolate selection restrictions (Sporleder and Li, 2009)as in  push one’s luck   under the assumption that onlyconcrete things can normally be pushed. Idioms candisobey typical subcategorization constraints (e.g.,  inline  without a determiner before line), or change thedefault assignments of semantic roles to syntactic cate-gories (e.g., in  X breaks something with Y  , Y typicallyis an instrument but for the idiom  break the ice , it ismore likely to fill a patient role as in  How to break theice with a stranger  ). In addition, many potentially id-iomatic expressions can be used either literally or fig-uratively, depending on the context. This presents agreat challenge for machine translation. For example,a machine translation system must translate  held fire differently in  Now, now, hold your fire until I’ve had achance to explain. Hold your fire, Bill. You’re too quick to complain.  and  The sergeant told the soldiers to hold their fire. Please hold your fire until I get out of theway . In fact, we tested the last two examples using theGoogle Translate engine and we got proper translationsof the two neither into Russian nor into Hebrew, Span-ish, or Chinese. Most current translation systems relyon large repositories of idioms. Unfortunately, thesesystems are not capable to tell apart literal from figura-tive usage of the same expression in context. Despitethe common perception that phrases that can be idiomsare mainly used in their idiomatic sense, Fazly et al.(2009)’s analysis of 60 idioms has shown that close tohalf of these also have a clear literal meaning; and of those with a literal meaning, on average around 40% of their usages are literal.In this paper we describe an algorithm for automaticclassification of idiomatic and literal expressions. Ourstarting point is that words in a given text segment,such as a paragraph, that are high-ranking representa-tives of a common topic of discussion are less likelyto be a part of an idiomatic expression. Our additionalhypothesis is that contexts in which idioms occur, typ-ically, are more affective and therefore, we incorpo-rate a simple analysis of the intensity of the emotionsexpressed by the contexts. We investigate the bag of words  topic  representation of one to three paragraphscontaining an expression that should be classified asidiomatic or literal (a target phrase). We extract top-ics from paragraphs containing idioms and from para-graphs containing literals using an unsupervised clus-tering method, Latent Dirichlet Allocation (LDA) (Bleiet al., 2003). Since idiomatic expressions exhibit theproperty of non-compositionality, we assume that theyusually present different semantics than the words used 2019  in the local topic. We treat idioms as semantic outliers,and the identification of semantic shift as outlier detec-tion. Thus, this topic representation allows us to differ-entiate idioms from literals using the local semantics.The paper is organized as follows. Section 2 brieflydescribes previous approaches to idiom recognition orclassification. In Section 3 we describe our approach indetail, including the hypothesis, the topic space repre-sentation, andtheproposedalgorithm. After describingthe preprocessing procedure in Section 4, we turn to theactual experiments in Sections 5 and 6. We then com-pare our approach to other approaches (Section 7) anddiscuss the results (Section 8). 2 Previous Work Previous approaches to idiom detection can be classi-fied into two groups: 1) Type-based extraction, i.e., de-tecting idioms at the type level; 2) token-based detec-tion, i.e., detecting idioms in context. Type-based ex-traction is based on the idea that idiomatic expressionsexhibit certain linguistic properties that can distinguishthem from literal expressions (Sag et al. (2002); Fa-zly et al. (2009)), among many others, discuss variousproperties of idioms. Some examples of such proper-ties include 1) lexical fixedness: e.g., neither ‘shootthe wind’ nor ‘hit the breeze’ are valid variations of the idiom shoot the breeze and 2) syntactic fixedness:e.g.,  The guy kicked the bucket   is potentially idiomaticwhereas  The bucket was kicked   is not idiomatic any-more; and of course, 3) non-compositionality. Thus,some approaches look at the tendency for words to oc-cur in one particular order, or a fixed pattern. Hearst(1992) identifies lexico-syntactic patterns that occurfrequently, are recognizable with little or no precodedknowledge, and indicate the lexical relation of interest.Widdows and Dorow (2005) use Hearst’s concept of lexicosyntactic patterns to extract idioms that consistof fixed patterns between two nouns. Basically, theirtechnique works by finding patterns such as “thrills andspills”, whose reversals (such as “spills and thrills”) arenever encountered.While many idioms do have these properties, manyidioms fall on the continuum from being composi-tional to being partly unanalyzable to completely non-compositional (Cook et al. (2007)). Fazly et al. (2009);Li and Sporleder (2010), among others, notice thattype-based approaches do not work on expressions thatcan be interpreted idiomatically or literally dependingon the context and thus, an approach that considers to-kensincontextismoreappropriateforthetaskofidiomrecognition.A number of token-based approaches have beendiscussed in the literature, both supervised (Katzand Giesbrech (2006)), weakly supervised (Birke andSarkar (2006)) and unsupervised (Sporleder and Li(2009); Fazly et al. (2009)). Fazly et al. (2009) de-velop statistical measures for each linguistic propertyof idiomatic expressions and use them both in a type-based classification task and in a token identificationtask, in which they distinguish idiomatic and literal us-ages of potentially idiomatic expressions in context.Sporleder and Li (2009) present a graph-based modelfor representing the lexical cohesion of a discourse.Nodes represent tokens in the discourse, which are con-nected by edges whose value is determined by a seman-tic relatedness function. They experiment with two dif-ferent approaches to semantic relatedness: 1) Depen-dency vectors, as described in Pado and Lapata (2007);2) Normalized Google Distance (Cilibrasi and Vit´anyi(2007)). Sporleder and Li (2009) show that this methodworks better for larger contexts (greater than five para-graphs). Li and Sporleder (2010) assume that literaland figurativedata aregenerated by two different Gaus-sians, literalandnon-literalandthedetectionisdonebycomparing which Gaussian model has a higher prob-ability to generate a specific instance. The approachassumes that the target expressions are already knownand the goal is to determine whether this expression isliteral or figurative in a particular context. The impor-tant insight of this method is that figurative languagein general exhibits less semantic cohesive ties with thecontext than literal language.Feldman and Peng (2013) describe several ap-proaches to automatic idiom identification. One of them is idiom recognition as outlier detection. Theyapply principal component analysis for outlier detec-tion – an approach that does not rely on costly an-notated training data and is not limited to a specifictype of a syntactic construction, and is generally lan-guage independent. The quantitative analysis providedin their work shows that the outlier detection algorithmperforms better and seems promising. The qualitativeanalysis also shows that their algorithm has to incor-porate several important properties of the idioms: (1)Idioms are relatively non-compositional, comparing toliteral expressions or other types of collocations. (2)Idioms violate local cohesive ties, as a result, they aresemantically distant from the local topics. (3) Whilenot all semantic outliers are idioms, non-compositionalsemantic outliers are likely to be idiomatic. (4) Id-iomaticity is not a binary property. Idioms fall on thecontinuum from being compositional to being partlyunanalyzable to completely non-compositional.The approach described below is taking Feldmanand Peng (2013)’s srcinal idea and is trying to address(2) directly and (1) indirectly. Our approach is alsosomewhat similar to Li and Sporleder (2010) because italso relies on a list of potentially idiomatic expressions. 3 Our Hypothesis Similarly to Feldman and Peng (2013), out startingpoint is that idioms are semantic outliers that violatecohesive structure, especially in local contexts. How-ever, our task is framed as supervised classification andwe rely on data annotated for idiomatic and literal ex-pressions. We hypothesize that words in a given text 2020  segment, such as a paragraph, that are high-rankingrepresentatives of a common topic of discussion areless likely to be a part of an idiomatic expression inthe document. 3.1 Topic Space Representation Instead of the simple bag of words representation of atarget document (segment of three paragraphs that con-tains a target phrase), we investigate the bag of wordstopic representation for target documents. That is, weextract topics from paragraphs containing idioms andfrom paragraphs containing literals using an unsuper-vised clustering method, Latent Dirichlet Allocation(LDA) (Blei et al., 2003). The idea is that if the LDAmodel is able to capture the semantics of a target docu-ment, an idiomatic phrase will be a “semantic” outlierof the themes. Thus, this topic representation will al-low us to differentiate idioms from literals using thesemantics of the local context.Let  d  =  { w 1 , ···  ,w N  } t be a segment (document)containing a target phrase, where  N   denotes the num-ber of terms in a given corpus, and  t  represents trans-pose. We first compute a set of   m  topics from  d . Wedenote this set by T  ( d ) =  { t 1 , ···  ,t m } , where  t i  = ( w 1 , ···  ,w k ) t . Here  w j  represents a wordfrom a vocabulary of   W   words. Thus, we have tworepresentations for  d : (1)  d , represented by its srcinalterms, and (2)  ˆ d , represented by its topic terms. Twocorresponding term by document matrices will be de-noted by  M  D  and  M  ˆ D , respectively, where  D  denotesa set of documents. That is,  M  D  represents the srcinal“text” term by document matrix, while  M  ˆ D  representsthe “topic” term by document matrix.Figure 1 shows the potential benefit of topic spacerepresentation. In the figure, text segments containingtarget phrase “blow whistle” are projected on a two di-mensional subspace. The left figure shows the projec-tion in the “text” space, represented by the term by doc-umentmatrix M  D . Themiddlefigureshowstheprojec-tion in the topic space, represented by  M  ˆ D . The topicspace representation seems to provide a better separa-tion.We note that when learning topics from a small datasample, learned topics can be less coherent and inter-pretable, thus less useful. To address this issue, regu-larized LDA has been proposed in the literature (New-man et al., 2011). A key feature is to favor words thatexhibit short range dependencies for a given topic. Wecan achieve a similar effect by placing restrictions onthe vocabulary. For example, when extracting topicsfrom segments containing idioms, we may restrict thevocabulary to contain words from these segments only.ThemiddleandrightfiguresinFigure1illustrateacasein point. The middle figure shows a projection onto thetopic space that is computed with a restricted vocabu-lary, while the right figure shows a projection when weplace no restriction on the vocabulary. That is, the vo-cabulary includes terms from documents that containboth idioms and literals.Note that by computing  M  ˆ D , the topic term by doc-ument matrix, from the training data, we have createda vocabulary, or a set of “features” (i.e., topic terms)that is used to directly describe a query or test segment.The main advantage is that topics are more accuratewhen computed by LDA from a large collection of id-iomatic or literal contexts. Thus, these topics capturemore accurately the semantic contexts in which the tar-get idiomatic and literal expressions typically occur. If a target query appears in a similar semantic context, thetopicswillbeabletodescribethisqueryaswell. Ontheother hand, one might similarly apply LDA to a givenquery to extract query topics, and create the query vec-tor from the query topics. The main disadvantage isthat LDA may not be able to extract topic terms thatmatch well with those in the training corpus, when ap-plied to the query in isolation. 3.2 Algorithm The main steps of the proposed algorithm, called TopSpace , are shown below. Input:  D  =  { d 1 , ···  ,d k ,d k +1 , ···  ,d n } : trainingdocuments of   k  idioms and  n  −  k  literals. Q  =  { q  1 , ···  ,q  l } :  l  query documents.1. Let  DicI   be the vocabulary determined solelyfrom idioms  { d 1 , ···  ,d k } . Similarly, let  DicL be the vocabulary obtained from literals { d k +1 , ···  ,d n } .2. For a document  d i  in  { d 1 , ···  ,d k } , apply LDAto extract a set of   m  topics  T  ( d i ) =  { t 1 , ···  ,t m } using  DicI  . For  d i  ∈ { d k +1 , ···  ,d n } ,  DicL  isused.3. Let  ˆ D  =  { ˆ d 1 , ···  ,  ˆ d k ,  ˆ d k +1 , ···  ,  ˆ d n }  be theresulting topic representation of   D .4. Compute the term by document matrix  M  ˆ D  from ˆ D , and let  DicT   and  gw  be the resultingdictionary and global weight ( idf  ), respectively.5. Compute the term by document matrix  M  Q  from Q , using  DicT   and  gw  from the previous step. Output :  M  ˆ D  and  M  Q To summarize, after splitting our corpus (see section4) into paragraphs and preprocessing it, we extract top-ics from paragraphs containing idioms and from para-graphs containing literals. We then compute a term bydocument matrix, where terms are topic terms and doc-uments are topics extracted from the paragraphs. Ourtest data are represented as a term-by-document matrixas well (See the details in section 5). 2021  −100−80−60−40−20020−200204060801002D Text Space: Blow Whistle  IdiomsLiterals −20−15−10−5051015−5051015202D Topic Space: Blow Whistle  IdiomsLiterals −12−10−8−6−4−202468−10−505101520252D Topic Space: Blow Whistle  IdiomsLiterals Figure 1: 2D projection of text segments containing “blow whistle.” Left panel: Original text space. Middle panel:Topic space with restricted vocabulary. Right panel: Topic space with enlarged vocabulary. 3.3 Fisher Linear Discriminant Analysis Once  M  ˆ D  and  M  Q  are obtained, a classification rulecan be applied to predict idioms vs. literals. The ap-proach we are taking in this work for classifying id-ioms vs. literals is based on Fisher’s discriminant anal-ysis (FDA) (Fukunaga, 1990). FDA often significantlysimplifies tasks such as regression and classification bycomputing low-dimensional subspaces having statisti-cally uncorrelated or discriminant variables. In lan-guage analysis, statistically uncorrelate or discriminantvariables are extracted and utilized for description, de-tection, and classification. Woods et al. (1986), for ex-ample, use statistically uncorrelated variables for lan-guage test scores. A group of subjects is scored on abattery of language tests, where the subtests measuredifferent abilities such as vocabulary, grammar or read-ing comprehension. Horvath (1985) analyzes speechsamples of Sydney speakers to determine the relativeoccurrence of five different variants of each of fivevowels sounds. Using this data, the speakers clusteraccording to such factors as gender, age, ethnicity andsocio-economic class.A similar approach has been discussed in Peng et al.(2010). FDA is a class of methods used in machinelearning to find the linear combination of features thatbest separate two classes of events. FDA is closelyrelated to principal component analysis (PCA), wherea linear combination of features that best explains thedata. Discriminant analysis explicitly exploits class in-formation in the data, while PCA does not.Idiom classification based on discriminant analysishas several advantages. First, as has been mentioned,it does not make any assumption regarding data distri-butions. Many statistical detection methods assume aGaussian distribution of normal data, which is far fromreality. Second, by using a few discriminants to de-scribe data, discriminant analysis provides a compactrepresentation of the data, resulting in increased com-putational efficiency and real time performance.In FDA, within-class, between-class, and mixturescatter matrices are used to formulate the criteria of class separability. Consider a  J   class problem, where m 0  is the mean vector of all data, and  m j  is the meanvector of   j th class data. A within-class scatter ma-trix characterizes the scatter of samples around theirrespective class mean vector, and it is expressed by S  w  = J   j =1  p jl j  i =1 ( x ji  −  m j )( x ji  −  m j ) t ,  (1)where  l j  is the size of the data in the  j th class,  p j (  j  p j  = 1 ) represents the proportion of the  j th classcontribution, and  t  denotes the transpose operator. Abetween-classscattermatrixcharacterizesthescatterof the class means around the mixture mean  m 0 . It is ex-pressed by S  b  = J   j =1  p j ( m j  −  m 0 )( m j  −  m 0 ) t .  (2)The mixture scatter matrix is the covariance matrix of all samples, regardless of their class assignment, and itis given by S  m  = l  i =1 ( x i  −  m 0 )( x i  −  m 0 ) t =  S  w  +  S  b .  (3)The Fisher criterion is used to find a projection matrix W   ∈  q × d that maximizes J  ( W  ) =  | W  t S  b W  || W  t S  w W  | .  (4)In order to determine the matrix  W   that maximizes J  ( W  ) , one can solve the generalized eigenvalue prob-lem:  S  b w i  =  λ i S  w w i . The eigenvectors correspondingtothelargesteigenvaluesformthecolumnsof  W  . Foratwo class problem, it can be written in a simpler form: S  w w  =  m  =  m 1  −  m 2 , where  m 1  and  m 2  are themeans of the two classes. 4 Data preprocessing 4.1 Verb-noun constructions For our experiments we use the British National Cor-pus(BNC,Burnard(2000))andalistofverb-nouncon-structions (VNCs) extracted from BNC by Fazly et al. 2022
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x