Natural Language Processing
[Based on Stanford NLP course]
Preference
Language Technology
Mostly solved:
Spam detectionPart-of-speech (POS) taggingNamed Entity Recognition (NER) Making good progress
Sentiment analysisCo-reference resolutionWord sense disambiguation (WSD)ParsingMachine translation (MT)Information Extraction (IE) Still really hard
Question answering (QA)ParaphraseSummarizationDialog
Why NLP difficult?
Non-standard Languagesegmentation issuesidiomsneologismsworld knowledgetricky entity names
Basic skills
Regular ExpressionsTokenizationWord Normalization and StemmingClassifier
Decision TreeLogistic RegressionSVMNeural Nets
Edit Distance
Used for
Spell correctionComputational Biology Basic operations
InsertionDeletionSubstitution Algorithm
LevenshteinBack traceNeedleman-WunschSmith-Waterman
Language Model
Probabilistic Language Models
Machine translationSpell correctionSpeech RecognitionSummarizationQuestion answering
Markov Assumption
$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i | \omega_{i-k} \dots \omega_{i-1}) $$
Unigram Model
$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i) $$
Bigram Model
$$ P(\omega_i | \omega_1 \omega_2 \dots \omega_{i-1}) \approx \prod_i P(\omega_i | \omega{i-1}) $$
Add-k Smoothing
$$ P_{Add-k}(\omega_i|\omega{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+k}{c(\omega_{i-1})+kV} $$
Unigram prior smoothing
$$ P_{Add-k}(\omega_i|\omega_{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+m(\tfrac{1}{V})}{c(\omega_{i-1})+m} $$
Smoothing Algorithm
Good-TuringKneser-NeyWitten-Bell
Spelling Correction
tasks:
Spelling error detectionSpelling error correction
AutocorrectSuggest a correctionSuggestion lists Real word spelling errors:
For each word $w$, generate candidate setChoose best candidateFind the correct word $w$ Candidate generation
words with similar spellingwords with similar pronunciation Factors that could influence p(misspelling | word)
source lettertarget lettersurrounding letterthe position in the wordnearby keys on the keyboardhomology on the keyboardpronunciationlikely morpheme transformations
Text Classification
Used for:
Assigning subject categories, topics, or genresSpam detectionAuthorship identificationAge/gender identificationLanguage identificationSentiment analysis…
Methods: Supervised Machine Learning
Naive BayesLogistic RegressionSupport Vector Machinesk-Nearset Neighbors
Naive Bayes
$$ C_{MAP}=arg\max_{c\in C}P(x_1,x_2,\dots,x_n|c)P© $$
Laplace (add-1) SmoothingUsed for Spam FilteringTraining data:
No training data: manually written rulesVery little data:
Use Naive BayesGet more labeled dataTry semi-supervised training methods A reasonable amount of data:
All clever Classifiers:
SVMRegularized Logistic Regression User-interpretable decision trees A huge amount of data:
At a cost
SVM (train time)kNN (test time)Regularized Logistic Regression can be somewhat better Naive Bayes Tweak performance
Domain-specificCollapse termsUpweighting some words
F Measure
Precision: % of selected items that are correct
Recall: % of correct items that are selected
/correctnot correct
selectedtpfpnot selectedfntn
$$ F=\tfrac{1}{\alpha \tfrac{1}{P} +(1-\alpha)\tfrac{1}{R}}=\tfrac{(\beta2+1)PR}{\beta2P+R} $$
Sentiment Analysis
Typology of Affective States
EmotionMoodInterpersonal stancesAttitudes (Sentiment Analysis)Personality traits Baseline Algorithm
TokenizationFeature ExtractionClassification
Naive BayesMaxEnt (better)SVM (better) Issues
HTML, XML and other markupsCapitalizationPhone numbers, datesEmoticons
Sentiment Lexicons
semi-supervised learning of lexicons
use a small amount of informationto bootstrap a lexiconadjectives conjoined by “and” have same polarityadjectives conjoined by “but” do not Process
label seed set of adjectivesexpand seed set to conjoined adjectivessupervised classifier assigns “polarity similarity” to each word pairclustering for partitioning Turney Algorithm
extract a phrasal lexicon from reviewslearn polarity of each phraserate a review by the average polarity of its phrases Advantages
domain-specificmore robust Assume classes have equal frequencies:
if not balanced: need to use F-scoressevere imbalancing also can degrade classifier performancesolutions:
Resampling in trainingCost-sensitive learning
penalize SVM more for misclassification of the rare thing Features:
Negation is importantUsing all words works well for some tasks (NB)Finding subsets of words may help in other tasks
Features
Joint and Discriminative
Joint (generative) models: place probabilities over both observed data and the hidden stuff ---- P(c,d)
N-gram modelsNaive Bayes classifiersHidden Markov modelsProbabilistic context-free grammarsIBM machine translation alignment models Discriminative (conditional) models: take the data as given, and put a probability over hidden structure given the data ---- P(c|d)
Logistic regressionConditional loglinearMaximum Entropy modelsConditional Random FieldsSVMsPerceptron
Features
Feature Expectations:
Empirical countModel expectation Feature-Based Models:
Text CategorizationWord-Sense DisambiguationPOS Tagging
Maximum Entropy
$$ \log P(C|D,\lambda)=\sum_{(c,d)\in (C,D)}\log P(c|d,\lambda)=\sum_{(c,d)\in(C,D)}\log \tfrac{exp \sum_{i} \lambda_if_i(c,d)}{\sum_{c’} exp\sum_i \lambda_if_i(c’,d)} $$
Find the optimal parameters
Gradient descent (GD), Stochastic gradient descent (SGD)Iterative proportional fitting methods: Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS)Conjugate gradient (CG), perhaps with preconditioningQuasi-Newton methods - limited memory variable metric (LMVM) methods, in particular, L-BFGS Feature Overlap
Maxent models handle overlapping features wellUnlike NB, there is no double counting Feature Interaction
Maxent models handle overlapping features well, but do not automatically model feature interactions Feature Interaction
If you want to interaction terms, you have to add themA disjunctive feature would also have done it Smoothing:
Issues of scale
Lots of featuresLots of sparsityOptimization problems Methods
Early stoppingPriors (MAP)RegularizationVirtual DataCount Cutoffs
Named Entity Recognition (NER)
The uses:
Named entities can be indexed, linked off, etc.Sentiment can be attributed to companies or productsA lot of IE relations are associations between named entitiesFor question answering, answers are often named entities Training:
Collect a set of representative training documentsLabel each token for its entity class or otherDesign feature extractors appropriate to the text and classesTrain a sequence classifier to predict the labels form the data Inference
Greedy
Fast, no extra memory requirementsVery easy to implementWith rich features including observations to the right, it may perform quite wellGreedy, we make commit errors we cannot recover from Beam
Fast, beam sizes of 3-5 are almost as good as exact inference in many casesEasy to implementInexact: the globally best sequence can fall off the beam Viterbi
Exact: the global best sequence is returnedHarder to implement long-distance state-state interactions CRFs
Training is slower, but CRFs avoid causal-competition biasesIn practice usually work much the same as MEMMs
Relation Extraction
How to build relation extractors
Hand-written patterns
High-precision and low-recallSpecific domainsA lot of work Supervised machine learning
MaxEnt, Naive Bayes, SVMCan get high accuracies with enough training dataLabeling a large training set is expensiveBrittle, don’t generalize well to different genres Semi-supervised and unsupervised
Bootstrapping (using seeds)
Find sentences with these pairsLook at the context between or around the pair and generalize the context to create patternsUse the patterns for grep for more pairs Distance supervision
Doesn’t require iteratively expanding patternsUses very large amounts of unlabeled dataNot sensitive to genre issues in training corpus Unsupervised learning from the web
Use parsed data to train a “trustworthy tuple” classifierSingle-pass extract all relations between NPs, keep if trustworthyAssessor ranks relations based on text redundancy
POS Tagging
Performance
About 97% currentlyBut baseline is already 90%Partly easy because
Many words are unambiguousYou get points for them and for punctuation marks Difficulty
ambiguous wordscommon words can be ambiguous Source of information
knowledge of neighboring wordsknowledge of word probabilitieswordlowercased wordprefixessuffixescapitalizationword shapes Summary
the change from generative to discriminative model does not by itself result in great improvementthe higher accuracy of discriminative models comes at the price of much slower training
Parsing
Treebank
reusability of the labor
many parser, POS taggers, etc.valuable resource for linguistics broad coveragefrequencies and distributional informationa way to evaluate systems Statistical parsing applications
high precision question answeringimproving biological named entity findingsyntactically based sentence compressionextracting interaction in computer gameshelping linguists find datasource sentence analysis for machine translationrelation extraction systems Phrase structure grammars (= context-free grammars, CFGs) in NLP
G = (T, C, N, S, L, R)
T is a set of terminal symbolsC is a set of preterminal symbolsN is a set of nonterminal symbolsS is the start symbolL is the lexicon, a set of items of the form X -> xR is the grammar, a set of items of the form X -> $\gamma$e is the empty symbol
Probabilistic Parsing
Probabilistic - or stochastic - context-free grammars (PCFGs)
G = (T, N, S, R, P)
P is a probability function Chomsky Normal Form
Reconstructing n-aries is easyReconstructing unaries/empties is trickierBinarization is crucial for cubic time CFG parsing Cocke-Kasami-Younger (CKY) Constituency Parsing
Unaries can by incorporated into the algorithmEmpties can be incorporatedBinarization is vital Performance
RobustPartial solution for grammar ambiguityGive a probabilistic language modelThe problem seems to be that PCFGs lack the lexicalization of a trigram model
Lexicalized Parsing
Charniak
Probabilistic conditioning is “top-down” like a regular PCFG, but actual parsing is bottom-up, somewhat like the CKY algorithm we saw Non-Independence
The independence assumptions of a PCFG are often too strongWe can relax independence assumptions by encoding dependencies into the PCFG symbols, by state splitting (sparseness) Accurate Unlexicalized Parsing
Grammar rules are not systematically specified to the level of lexical itemsClosed vs. open class words Learning Latent Annotations
brackets are knownbase categories are knowninduce subcategoriesclever split/merge category refinementEM, like Forward-Backward for HMMs, but constrained by tree
Dependency Parsing
Methods
Dynamic programming (like in the CKY algorithm)Graph algorithmConstraint SatisfactionDeterministic Parsing Sources of information
Bilexical affinitiesDependency distanceIntervening materialValency of heads MaltParser
GreedyBottom upHas
a stacka buffera set of dependency arcsa set of actions Each action is predicted by a discriminative classifier (SVM)No searchProvides close to state of the art parsing performanceProvides very fast linear time parsing Projective
Dependencies from a CFG tree using heads, must be projectiveBut dependency theory normally does allow non-projective structure to account for displaced constituentsThe arc-eager algorithm only builds projective dependency trees Stanford Dependencies
ProjectiveCan be generated by postprocessing headed phrase structure parses, or dependency parsers like MaltParser or the Easy-First Parser
Information Retrieval
Used for
web searche-mail searchsearching your laptopcorporate knowledge baseslegal information retrieval
Classic search
User taskInfo needQuery & CollectionSearch engineResults
Initial stages of text processing
TokenizationNormalizationStemmingStop words
Query processing
AND
“merge” algorithm Phrase queries
Biword indexes
false positivesbigger dictionary Positional indexes
Extract inverted index entries for each distinct termMerge their doc:position lists to enumerate all positionsSame general method for proximity searches A positional index is 2-4 as large as a non-positional indexCaveat: all of this holds for “English-like” languageThese two approaches can be profitably combined
Ranked Retrieval
Advantage
Free text querieslarge result sets are not an issue Query-document matching scores
Jaccard coefficient
Doesn’t consider term frequencyLength normalization needed Bag of words model
Term frequency (tf)Log-frequency weightingInverse document frequency (idf)
tf-idf weighting
$$ W_{t,d}=(1+\log tf_{t,d})\times \log_{10}(N/df_t) $$
Best known weighting scheme in information retrieval
Distance: cosine(query, document)
$$ \cos(\vec q,\vec d)=\tfrac{\vec q \bullet \vec d}{|\vec q||\vec d|}=\tfrac{\vec q}{|\vec q|}\bullet \tfrac{\vec d}{|\vec d|}=\tfrac{\sum{|V|}_{i=1}q_id_i}{\sqrt{\sum{|V|}{i=1}q_i2}\sqrt{\sum{|V|}{i=1}d^2_i}} $$
Weighting
Many search engines allow for different weightings for queries vs. documentsA very standard weighting scheme is: Inc.ItcDocument: logarithmic tf, no idf and cosine normalizationQuery: logarithmic tf idf, cosine normalization
Evaluation
Mean average precision (MAP)
Semantic
Situation
Reminder: lemma and wordformHomonymy
HomographsHomophones PolysemySynonymsAntonymsHyponymy and HypernymyHyponyms and Instances
Applications of Thesauri and Ontologies
Information ExtractionInformation RetrievalQuestion AnsweringBioinformatics and Medical InformaticsMachine Translation
Word Similarity
Synonymy and similaritySimilarity algorithm
Thesaurus-based algorithmDistributional algorithms
Thesaurus-based similarity
$LCS(c_1,c_2)=$ The most informative (lowest) node in the hierarchy subsuming both $c_1$ and $c_2$$$ Sim_{path}(c_1,c_2)=\tfrac{1}{pathlen(c_1,c_2)} $$$$ Sim_{resnik}(c_1,c_2)=-\log P(LCS(c_1,c_2)) $$$$ Sim_{lin}(c_1,c_2)=\tfrac{1\log P(LCS(c_1,c_2))}{\log P(c_1)+\log P(c_2)} $$$$ Sim_{jiangconrath}(c_1,c_2)=\tfrac{1}{\log P(c_1)+\log P(c_2)-2\log P(LCS(c_1,c_2))} $$$$ Sim_{eLesk}(c_1,c_2)=\sum_{r,q\in RELS}overlap(gloss(r(c_1)),gloss(q(c_2))) $$
Evaluating
Intrinsic
Correlation between algorithm and human word similarity ratings Extrinsic (task_based, end-to-end)
Malapropism (Spelling error) detectionWSDEssay gradingTaking TOEFL multiple-choice vocabulary tests Problems
We don’t have a thesaurus for every languagerecall
missing wordsmissing phrasesmissing connections between sensesworks less well for verbs, adj.
Distributional models of meaning
For the term-document matrix: tf-idfFor the term-context matrix: Positive Pointwise Mutual Information (PPMI) is common
$$ PMI(w_1,w_2)=\log_2\tfrac{P(w_1,w_2)}{P(w_1)P(w_2)} $$
PMI is biased toward infrequent events
various weighting schemesadd-one smoothing
Question Answering
Question processing
Detect question type, answer type (NER), focus, relationsFormulate queries to send a search engine Passage Retrieval
Retrieval ranked documentsBreak into suitable passages and re-rank Answer processing
Extract candidate answersRank candidates
Approaches
Knowledge-based
build a semantic representation of the queryMap from this semantics to query structured data or resources Hybrid
build a shallow semantic representation of the querygenerate answer candidate using IR methodsScore each candidate using richer knowledge sources
Answer type taxonomy
6 coarse classes
AbbreviationEntityDescriptionHumanLocationNumeric 50 finer classesDetection
Hand-written rulesMachine LearningHybrids Features
Question words and phrasesPart-of-speech tagsParse featuresNamed EntitiesSemantically related words
Keyword selection
Non-stop wordsNNP words in recognized named entitiesComplex nominals with their adjectival modifiersOther complex nominalsNouns with their adjectival modifiersOther nounsVerbsAdverbsQFW word (skipped in all previous steps)Other words
Passage Retrieval
IR engine retrieves documents using query termsSegment the documents into shorter unitsPassage ranking
number of named entities of the right type in passagenumber of query words in passagenumber of question N-grams also in passageproximity of query keywords to each other in passagelongest sequence of question wordsrank of the document containing passage
Features for ranking candidate answers
answer type matchpattern matchquestion keywordskeyword distancenovelty factorapposition featurespunctuation locationsequences of question terms
Common Evaluation Metrics
AccuracyMean Reciprocal Rank (MRR)
$$ MRR = \tfrac{\sum_{i=1}^N \tfrac{1}{rank_i}}{N} $$
Summarization
Applications
outlines or abstractssummariesaction itemssimplifying Three stages
content selectioninformation orderingsentence realization salient words
tf-idftopic signature
mutual informationlog-likelihood ratio (LLR)
$$weight(w_i)=\begin{cases}1,& if -2\log \lambda(w_i)>10 \ 0,& otherwise\end{cases}$$
Supervised content selection problem
hard to get labeled training dataalignment difficultperformance not better than unsupervised algorithm ROUGE (Recall Oriented Understudy for Gisting Evaluation)
Intrinsic metric for automatically evaluating summaries
based on BLEUnot as good as human evaluationmuch more convenient
$$ ROUGE-2=\tfrac{\sum_{x\in {RefSummaries}}\sum_{bigrams:i\in S}\min(count(i,X),count(i,S))}{\sum_{x\in{RefSummaries}}\sum_{bigrams:i\in S}count(i,S)} $$
Maximal Marginal Relevance (MMR)
Iteratively (greedily)
Relevant: high cosine similarity to the queryNovel: low cosine similarity to the summary Stop when desired length Information Ordering
Chronological orderingCoherenceTopical ordering