Natural Language Processing Knowledge

xiaoxiao2025-07-10 12

Natural Language Processing

[Based on Stanford NLP course]

Preference

Language Technology

Mostly solved: Spam detectionPart-of-speech (POS) taggingNamed Entity Recognition (NER) Making good progress Sentiment analysisCo-reference resolutionWord sense disambiguation (WSD)ParsingMachine translation (MT)Information Extraction (IE) Still really hard Question answering (QA)ParaphraseSummarizationDialog

Why NLP difficult?

Non-standard Languagesegmentation issuesidiomsneologismsworld knowledgetricky entity names

Basic skills

Regular ExpressionsTokenizationWord Normalization and StemmingClassifier Decision TreeLogistic RegressionSVMNeural Nets

Edit Distance

Used for Spell correctionComputational Biology Basic operations InsertionDeletionSubstitution Algorithm LevenshteinBack traceNeedleman-WunschSmith-Waterman

Language Model

Probabilistic Language Models

Machine translationSpell correctionSpeech RecognitionSummarizationQuestion answering

Markov Assumption

$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i | \omega_{i-k} \dots \omega_{i-1}) $$

Unigram Model

$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i) $$

Bigram Model

$$ P(\omega_i | \omega_1 \omega_2 \dots \omega_{i-1}) \approx \prod_i P(\omega_i | \omega{i-1}) $$

Add-k Smoothing

$$ P_{Add-k}(\omega_i|\omega{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+k}{c(\omega_{i-1})+kV} $$

Unigram prior smoothing

$$ P_{Add-k}(\omega_i|\omega_{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+m(\tfrac{1}{V})}{c(\omega_{i-1})+m} $$

Smoothing Algorithm

Good-TuringKneser-NeyWitten-Bell

Spelling Correction

tasks: Spelling error detectionSpelling error correction AutocorrectSuggest a correctionSuggestion lists Real word spelling errors: For each word $w$, generate candidate setChoose best candidateFind the correct word $w$ Candidate generation words with similar spellingwords with similar pronunciation Factors that could influence p(misspelling | word) source lettertarget lettersurrounding letterthe position in the wordnearby keys on the keyboardhomology on the keyboardpronunciationlikely morpheme transformations

Text Classification

Used for:

Assigning subject categories, topics, or genresSpam detectionAuthorship identificationAge/gender identificationLanguage identificationSentiment analysis…

Methods: Supervised Machine Learning

Naive BayesLogistic RegressionSupport Vector Machinesk-Nearset Neighbors

Naive Bayes

$$ C_{MAP}=arg\max_{c\in C}P(x_1,x_2,\dots,x_n|c)P© $$

Laplace (add-1) SmoothingUsed for Spam FilteringTraining data: No training data: manually written rulesVery little data: Use Naive BayesGet more labeled dataTry semi-supervised training methods A reasonable amount of data: All clever Classifiers: SVMRegularized Logistic Regression User-interpretable decision trees A huge amount of data: At a cost SVM (train time)kNN (test time)Regularized Logistic Regression can be somewhat better Naive Bayes Tweak performance Domain-specificCollapse termsUpweighting some words

F Measure

Precision: % of selected items that are correct

Recall: % of correct items that are selected

/correctnot correctselectedtpfpnot selectedfntn

$$ F=\tfrac{1}{\alpha \tfrac{1}{P} +(1-\alpha)\tfrac{1}{R}}=\tfrac{(\beta^{2+1)PR}{\beta}2P+R} $$

Sentiment Analysis

Typology of Affective States EmotionMoodInterpersonal stancesAttitudes (Sentiment Analysis)Personality traits Baseline Algorithm TokenizationFeature ExtractionClassification Naive BayesMaxEnt (better)SVM (better) Issues HTML, XML and other markupsCapitalizationPhone numbers, datesEmoticons

Sentiment Lexicons

semi-supervised learning of lexicons use a small amount of informationto bootstrap a lexiconadjectives conjoined by “and” have same polarityadjectives conjoined by “but” do not Process label seed set of adjectivesexpand seed set to conjoined adjectivessupervised classifier assigns “polarity similarity” to each word pairclustering for partitioning Turney Algorithm extract a phrasal lexicon from reviewslearn polarity of each phraserate a review by the average polarity of its phrases Advantages domain-specificmore robust Assume classes have equal frequencies: if not balanced: need to use F-scoressevere imbalancing also can degrade classifier performancesolutions: Resampling in trainingCost-sensitive learning penalize SVM more for misclassification of the rare thing Features: Negation is importantUsing all words works well for some tasks (NB)Finding subsets of words may help in other tasks

Features

Joint and Discriminative

Joint (generative) models: place probabilities over both observed data and the hidden stuff ---- P(c,d) N-gram modelsNaive Bayes classifiersHidden Markov modelsProbabilistic context-free grammarsIBM machine translation alignment models Discriminative (conditional) models: take the data as given, and put a probability over hidden structure given the data ---- P(c|d) Logistic regressionConditional loglinearMaximum Entropy modelsConditional Random FieldsSVMsPerceptron

Features

Feature Expectations: Empirical countModel expectation Feature-Based Models: Text CategorizationWord-Sense DisambiguationPOS Tagging

Maximum Entropy

$$ \log P(C|D,\lambda)=\sum_{(c,d)\in (C,D)}\log P(c|d,\lambda)=\sum_{(c,d)\in(C,D)}\log \tfrac{exp \sum_{i} \lambda_if_i(c,d)}{\sum_{c’} exp\sum_i \lambda_if_i(c’,d)} $$

Find the optimal parameters Gradient descent (GD), Stochastic gradient descent (SGD)Iterative proportional fitting methods: Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS)Conjugate gradient (CG), perhaps with preconditioningQuasi-Newton methods - limited memory variable metric (LMVM) methods, in particular, L-BFGS Feature Overlap Maxent models handle overlapping features wellUnlike NB, there is no double counting Feature Interaction Maxent models handle overlapping features well, but do not automatically model feature interactions Feature Interaction If you want to interaction terms, you have to add themA disjunctive feature would also have done it Smoothing: Issues of scale Lots of featuresLots of sparsityOptimization problems Methods Early stoppingPriors (MAP)RegularizationVirtual DataCount Cutoffs

Named Entity Recognition (NER)

The uses: Named entities can be indexed, linked off, etc.Sentiment can be attributed to companies or productsA lot of IE relations are associations between named entitiesFor question answering, answers are often named entities Training: Collect a set of representative training documentsLabel each token for its entity class or otherDesign feature extractors appropriate to the text and classesTrain a sequence classifier to predict the labels form the data Inference Greedy Fast, no extra memory requirementsVery easy to implementWith rich features including observations to the right, it may perform quite wellGreedy, we make commit errors we cannot recover from Beam Fast, beam sizes of 3-5 are almost as good as exact inference in many casesEasy to implementInexact: the globally best sequence can fall off the beam Viterbi Exact: the global best sequence is returnedHarder to implement long-distance state-state interactions CRFs Training is slower, but CRFs avoid causal-competition biasesIn practice usually work much the same as MEMMs

Relation Extraction

How to build relation extractors Hand-written patterns High-precision and low-recallSpecific domainsA lot of work Supervised machine learning MaxEnt, Naive Bayes, SVMCan get high accuracies with enough training dataLabeling a large training set is expensiveBrittle, don’t generalize well to different genres Semi-supervised and unsupervised Bootstrapping (using seeds) Find sentences with these pairsLook at the context between or around the pair and generalize the context to create patternsUse the patterns for grep for more pairs Distance supervision Doesn’t require iteratively expanding patternsUses very large amounts of unlabeled dataNot sensitive to genre issues in training corpus Unsupervised learning from the web Use parsed data to train a “trustworthy tuple” classifierSingle-pass extract all relations between NPs, keep if trustworthyAssessor ranks relations based on text redundancy

POS Tagging

Performance About 97% currentlyBut baseline is already 90%Partly easy because Many words are unambiguousYou get points for them and for punctuation marks Difficulty ambiguous wordscommon words can be ambiguous Source of information knowledge of neighboring wordsknowledge of word probabilitieswordlowercased wordprefixessuffixescapitalizationword shapes Summary the change from generative to discriminative model does not by itself result in great improvementthe higher accuracy of discriminative models comes at the price of much slower training

Parsing

Treebank reusability of the labor many parser, POS taggers, etc.valuable resource for linguistics broad coveragefrequencies and distributional informationa way to evaluate systems Statistical parsing applications high precision question answeringimproving biological named entity findingsyntactically based sentence compressionextracting interaction in computer gameshelping linguists find datasource sentence analysis for machine translationrelation extraction systems Phrase structure grammars (= context-free grammars, CFGs) in NLP G = (T, C, N, S, L, R) T is a set of terminal symbolsC is a set of preterminal symbolsN is a set of nonterminal symbolsS is the start symbolL is the lexicon, a set of items of the form X -> xR is the grammar, a set of items of the form X -> $\gamma$e is the empty symbol

Probabilistic Parsing

Probabilistic - or stochastic - context-free grammars (PCFGs) G = (T, N, S, R, P) P is a probability function Chomsky Normal Form Reconstructing n-aries is easyReconstructing unaries/empties is trickierBinarization is crucial for cubic time CFG parsing Cocke-Kasami-Younger (CKY) Constituency Parsing Unaries can by incorporated into the algorithmEmpties can be incorporatedBinarization is vital Performance RobustPartial solution for grammar ambiguityGive a probabilistic language modelThe problem seems to be that PCFGs lack the lexicalization of a trigram model

Lexicalized Parsing

Charniak Probabilistic conditioning is “top-down” like a regular PCFG, but actual parsing is bottom-up, somewhat like the CKY algorithm we saw Non-Independence The independence assumptions of a PCFG are often too strongWe can relax independence assumptions by encoding dependencies into the PCFG symbols, by state splitting (sparseness) Accurate Unlexicalized Parsing Grammar rules are not systematically specified to the level of lexical itemsClosed vs. open class words Learning Latent Annotations brackets are knownbase categories are knowninduce subcategoriesclever split/merge category refinementEM, like Forward-Backward for HMMs, but constrained by tree

Dependency Parsing

Methods Dynamic programming (like in the CKY algorithm)Graph algorithmConstraint SatisfactionDeterministic Parsing Sources of information Bilexical affinitiesDependency distanceIntervening materialValency of heads MaltParser GreedyBottom upHas a stacka buffera set of dependency arcsa set of actions Each action is predicted by a discriminative classifier (SVM)No searchProvides close to state of the art parsing performanceProvides very fast linear time parsing Projective Dependencies from a CFG tree using heads, must be projectiveBut dependency theory normally does allow non-projective structure to account for displaced constituentsThe arc-eager algorithm only builds projective dependency trees Stanford Dependencies ProjectiveCan be generated by postprocessing headed phrase structure parses, or dependency parsers like MaltParser or the Easy-First Parser

Information Retrieval

Used for web searche-mail searchsearching your laptopcorporate knowledge baseslegal information retrieval

Classic search

User taskInfo needQuery & CollectionSearch engineResults

Initial stages of text processing

TokenizationNormalizationStemmingStop words

Query processing

AND “merge” algorithm Phrase queries Biword indexes false positivesbigger dictionary Positional indexes Extract inverted index entries for each distinct termMerge their doc:position lists to enumerate all positionsSame general method for proximity searches A positional index is 2-4 as large as a non-positional indexCaveat: all of this holds for “English-like” languageThese two approaches can be profitably combined

Ranked Retrieval

Advantage Free text querieslarge result sets are not an issue Query-document matching scores Jaccard coefficient Doesn’t consider term frequencyLength normalization needed Bag of words model Term frequency (tf)Log-frequency weightingInverse document frequency (idf)

tf-idf weighting

$$ W_{t,d}=(1+\log tf_{t,d})\times \log_{10}(N/df_t) $$

Best known weighting scheme in information retrieval

Distance: cosine(query, document)

$$ \cos(\vec q,\vec d)=\tfrac{\vec q \bullet \vec d}{|\vec q||\vec d|}=\tfrac{\vec q}{|\vec q|}\bullet \tfrac{\vec d}{|\vec d|}=\tfrac{\sum^{{|V|}_{i=1}q_id_i}{\sqrt{\sum}{|V|}{i=1}q_i^2}\sqrt{\sum{|V|}{i=1}d^2_i}} $$

Weighting

Many search engines allow for different weightings for queries vs. documentsA very standard weighting scheme is: Inc.ItcDocument: logarithmic tf, no idf and cosine normalizationQuery: logarithmic tf idf, cosine normalization

Evaluation

Mean average precision (MAP)

Semantic

Situation

Reminder: lemma and wordformHomonymy HomographsHomophones PolysemySynonymsAntonymsHyponymy and HypernymyHyponyms and Instances

Applications of Thesauri and Ontologies

Information ExtractionInformation RetrievalQuestion AnsweringBioinformatics and Medical InformaticsMachine Translation

Word Similarity

Synonymy and similaritySimilarity algorithm Thesaurus-based algorithmDistributional algorithms

Thesaurus-based similarity

$LCS(c_1,c_2)=$ The most informative (lowest) node in the hierarchy subsuming both $c_1$ and $c_2$$$ Sim_{path}(c_1,c_2)=\tfrac{1}{pathlen(c_1,c_2)} $$$$ Sim_{resnik}(c_1,c_2)=-\log P(LCS(c_1,c_2)) $$$$ Sim_{lin}(c_1,c_2)=\tfrac{1\log P(LCS(c_1,c_2))}{\log P(c_1)+\log P(c_2)} $$$$ Sim_{jiangconrath}(c_1,c_2)=\tfrac{1}{\log P(c_1)+\log P(c_2)-2\log P(LCS(c_1,c_2))} $$$$ Sim_{eLesk}(c_1,c_2)=\sum_{r,q\in RELS}overlap(gloss(r(c_1)),gloss(q(c_2))) $$

Evaluating Intrinsic Correlation between algorithm and human word similarity ratings Extrinsic (task_based, end-to-end) Malapropism (Spelling error) detectionWSDEssay gradingTaking TOEFL multiple-choice vocabulary tests Problems We don’t have a thesaurus for every languagerecall missing wordsmissing phrasesmissing connections between sensesworks less well for verbs, adj.

Distributional models of meaning

For the term-document matrix: tf-idfFor the term-context matrix: Positive Pointwise Mutual Information (PPMI) is common

$$ PMI(w_1,w_2)=\log_2\tfrac{P(w_1,w_2)}{P(w_1)P(w_2)} $$

PMI is biased toward infrequent events various weighting schemesadd-one smoothing

Question Answering

Question processing Detect question type, answer type (NER), focus, relationsFormulate queries to send a search engine Passage Retrieval Retrieval ranked documentsBreak into suitable passages and re-rank Answer processing Extract candidate answersRank candidates

Approaches

Knowledge-based build a semantic representation of the queryMap from this semantics to query structured data or resources Hybrid build a shallow semantic representation of the querygenerate answer candidate using IR methodsScore each candidate using richer knowledge sources

Answer type taxonomy

6 coarse classes AbbreviationEntityDescriptionHumanLocationNumeric 50 finer classesDetection Hand-written rulesMachine LearningHybrids Features Question words and phrasesPart-of-speech tagsParse featuresNamed EntitiesSemantically related words

Keyword selection

Non-stop wordsNNP words in recognized named entitiesComplex nominals with their adjectival modifiersOther complex nominalsNouns with their adjectival modifiersOther nounsVerbsAdverbsQFW word (skipped in all previous steps)Other words

Passage Retrieval

IR engine retrieves documents using query termsSegment the documents into shorter unitsPassage ranking number of named entities of the right type in passagenumber of query words in passagenumber of question N-grams also in passageproximity of query keywords to each other in passagelongest sequence of question wordsrank of the document containing passage

Features for ranking candidate answers

answer type matchpattern matchquestion keywordskeyword distancenovelty factorapposition featurespunctuation locationsequences of question terms

Common Evaluation Metrics

AccuracyMean Reciprocal Rank (MRR)

$$ MRR = \tfrac{\sum_{i=1}^N \tfrac{1}{rank_i}}{N} $$

Summarization

Applications outlines or abstractssummariesaction itemssimplifying Three stages content selectioninformation orderingsentence realization salient words tf-idftopic signature mutual informationlog-likelihood ratio (LLR)

$$weight(w_i)=\begin{cases}1,& if -2\log \lambda(w_i)>10 \ 0,& otherwise\end{cases}$$

Supervised content selection problem hard to get labeled training dataalignment difficultperformance not better than unsupervised algorithm ROUGE (Recall Oriented Understudy for Gisting Evaluation) Intrinsic metric for automatically evaluating summaries based on BLEUnot as good as human evaluationmuch more convenient

$$ ROUGE-2=\tfrac{\sum_{x\in {RefSummaries}}\sum_{bigrams:i\in S}\min(count(i,X),count(i,S))}{\sum_{x\in{RefSummaries}}\sum_{bigrams:i\in S}count(i,S)} $$

Maximal Marginal Relevance (MMR) Iteratively (greedily) Relevant: high cosine similarity to the queryNovel: low cosine similarity to the summary Stop when desired length Information Ordering Chronological orderingCoherenceTopical ordering

转载请注明原文地址: https://ju.6miu.com/read-1300542.html

最新回复(0)