Natural Language Processing Knowledge

    xiaoxiao2025-07-10  9

    Natural Language Processing

    [Based on Stanford NLP course]

    Preference

    Language Technology

    Mostly solved: Spam detectionPart-of-speech (POS) taggingNamed Entity Recognition (NER) Making good progress Sentiment analysisCo-reference resolutionWord sense disambiguation (WSD)ParsingMachine translation (MT)Information Extraction (IE) Still really hard Question answering (QA)ParaphraseSummarizationDialog

    Why NLP difficult?

    Non-standard Languagesegmentation issuesidiomsneologismsworld knowledgetricky entity names

    Basic skills

    Regular ExpressionsTokenizationWord Normalization and StemmingClassifier Decision TreeLogistic RegressionSVMNeural Nets

    Edit Distance

    Used for Spell correctionComputational Biology Basic operations InsertionDeletionSubstitution Algorithm LevenshteinBack traceNeedleman-WunschSmith-Waterman

    Language Model

    Probabilistic Language Models

    Machine translationSpell correctionSpeech RecognitionSummarizationQuestion answering

    Markov Assumption

    $$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i | \omega_{i-k} \dots \omega_{i-1}) $$

    Unigram Model

    $$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i) $$

    Bigram Model

    $$ P(\omega_i | \omega_1 \omega_2 \dots \omega_{i-1}) \approx \prod_i P(\omega_i | \omega{i-1}) $$

    Add-k Smoothing

    $$ P_{Add-k}(\omega_i|\omega{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+k}{c(\omega_{i-1})+kV} $$

    Unigram prior smoothing

    $$ P_{Add-k}(\omega_i|\omega_{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+m(\tfrac{1}{V})}{c(\omega_{i-1})+m} $$

    Smoothing Algorithm

    Good-TuringKneser-NeyWitten-Bell

    Spelling Correction

    tasks: Spelling error detectionSpelling error correction AutocorrectSuggest a correctionSuggestion lists Real word spelling errors: For each word $w$, generate candidate setChoose best candidateFind the correct word $w$ Candidate generation words with similar spellingwords with similar pronunciation Factors that could influence p(misspelling | word) source lettertarget lettersurrounding letterthe position in the wordnearby keys on the keyboardhomology on the keyboardpronunciationlikely morpheme transformations

    Text Classification

    Used for:

    Assigning subject categories, topics, or genresSpam detectionAuthorship identificationAge/gender identificationLanguage identificationSentiment analysis…

    Methods: Supervised Machine Learning

    Naive BayesLogistic RegressionSupport Vector Machinesk-Nearset Neighbors

    Naive Bayes

    $$ C_{MAP}=arg\max_{c\in C}P(x_1,x_2,\dots,x_n|c)P© $$

    Laplace (add-1) SmoothingUsed for Spam FilteringTraining data: No training data: manually written rulesVery little data: Use Naive BayesGet more labeled dataTry semi-supervised training methods A reasonable amount of data: All clever Classifiers: SVMRegularized Logistic Regression User-interpretable decision trees A huge amount of data: At a cost SVM (train time)kNN (test time)Regularized Logistic Regression can be somewhat better Naive Bayes Tweak performance Domain-specificCollapse termsUpweighting some words

    F Measure

    Precision: % of selected items that are correct

    Recall: % of correct items that are selected

    /correctnot correctselectedtpfpnot selectedfntn

    $$ F=\tfrac{1}{\alpha \tfrac{1}{P} +(1-\alpha)\tfrac{1}{R}}=\tfrac{(\beta2+1)PR}{\beta2P+R} $$

    Sentiment Analysis

    Typology of Affective States EmotionMoodInterpersonal stancesAttitudes (Sentiment Analysis)Personality traits Baseline Algorithm TokenizationFeature ExtractionClassification Naive BayesMaxEnt (better)SVM (better) Issues HTML, XML and other markupsCapitalizationPhone numbers, datesEmoticons

    Sentiment Lexicons

    semi-supervised learning of lexicons use a small amount of informationto bootstrap a lexiconadjectives conjoined by “and” have same polarityadjectives conjoined by “but” do not Process label seed set of adjectivesexpand seed set to conjoined adjectivessupervised classifier assigns “polarity similarity” to each word pairclustering for partitioning Turney Algorithm extract a phrasal lexicon from reviewslearn polarity of each phraserate a review by the average polarity of its phrases Advantages domain-specificmore robust Assume classes have equal frequencies: if not balanced: need to use F-scoressevere imbalancing also can degrade classifier performancesolutions: Resampling in trainingCost-sensitive learning penalize SVM more for misclassification of the rare thing Features: Negation is importantUsing all words works well for some tasks (NB)Finding subsets of words may help in other tasks

    Features

    Joint and Discriminative

    Joint (generative) models: place probabilities over both observed data and the hidden stuff ---- P(c,d) N-gram modelsNaive Bayes classifiersHidden Markov modelsProbabilistic context-free grammarsIBM machine translation alignment models Discriminative (conditional) models: take the data as given, and put a probability over hidden structure given the data ---- P(c|d) Logistic regressionConditional loglinearMaximum Entropy modelsConditional Random FieldsSVMsPerceptron

    Features

    Feature Expectations: Empirical countModel expectation Feature-Based Models: Text CategorizationWord-Sense DisambiguationPOS Tagging

    Maximum Entropy

    $$ \log P(C|D,\lambda)=\sum_{(c,d)\in (C,D)}\log P(c|d,\lambda)=\sum_{(c,d)\in(C,D)}\log \tfrac{exp \sum_{i} \lambda_if_i(c,d)}{\sum_{c’} exp\sum_i \lambda_if_i(c’,d)} $$

    Find the optimal parameters Gradient descent (GD), Stochastic gradient descent (SGD)Iterative proportional fitting methods: Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS)Conjugate gradient (CG), perhaps with preconditioningQuasi-Newton methods - limited memory variable metric (LMVM) methods, in particular, L-BFGS Feature Overlap Maxent models handle overlapping features wellUnlike NB, there is no double counting Feature Interaction Maxent models handle overlapping features well, but do not automatically model feature interactions Feature Interaction If you want to interaction terms, you have to add themA disjunctive feature would also have done it Smoothing: Issues of scale Lots of featuresLots of sparsityOptimization problems Methods Early stoppingPriors (MAP)RegularizationVirtual DataCount Cutoffs

    Named Entity Recognition (NER)

    The uses: Named entities can be indexed, linked off, etc.Sentiment can be attributed to companies or productsA lot of IE relations are associations between named entitiesFor question answering, answers are often named entities Training: Collect a set of representative training documentsLabel each token for its entity class or otherDesign feature extractors appropriate to the text and classesTrain a sequence classifier to predict the labels form the data Inference Greedy Fast, no extra memory requirementsVery easy to implementWith rich features including observations to the right, it may perform quite wellGreedy, we make commit errors we cannot recover from Beam Fast, beam sizes of 3-5 are almost as good as exact inference in many casesEasy to implementInexact: the globally best sequence can fall off the beam Viterbi Exact: the global best sequence is returnedHarder to implement long-distance state-state interactions CRFs Training is slower, but CRFs avoid causal-competition biasesIn practice usually work much the same as MEMMs

    Relation Extraction

    How to build relation extractors Hand-written patterns High-precision and low-recallSpecific domainsA lot of work Supervised machine learning MaxEnt, Naive Bayes, SVMCan get high accuracies with enough training dataLabeling a large training set is expensiveBrittle, don’t generalize well to different genres Semi-supervised and unsupervised Bootstrapping (using seeds) Find sentences with these pairsLook at the context between or around the pair and generalize the context to create patternsUse the patterns for grep for more pairs Distance supervision Doesn’t require iteratively expanding patternsUses very large amounts of unlabeled dataNot sensitive to genre issues in training corpus Unsupervised learning from the web Use parsed data to train a “trustworthy tuple” classifierSingle-pass extract all relations between NPs, keep if trustworthyAssessor ranks relations based on text redundancy

    POS Tagging

    Performance About 97% currentlyBut baseline is already 90%Partly easy because Many words are unambiguousYou get points for them and for punctuation marks Difficulty ambiguous wordscommon words can be ambiguous Source of information knowledge of neighboring wordsknowledge of word probabilitieswordlowercased wordprefixessuffixescapitalizationword shapes Summary the change from generative to discriminative model does not by itself result in great improvementthe higher accuracy of discriminative models comes at the price of much slower training

    Parsing

    Treebank reusability of the labor many parser, POS taggers, etc.valuable resource for linguistics broad coveragefrequencies and distributional informationa way to evaluate systems Statistical parsing applications high precision question answeringimproving biological named entity findingsyntactically based sentence compressionextracting interaction in computer gameshelping linguists find datasource sentence analysis for machine translationrelation extraction systems Phrase structure grammars (= context-free grammars, CFGs) in NLP G = (T, C, N, S, L, R) T is a set of terminal symbolsC is a set of preterminal symbolsN is a set of nonterminal symbolsS is the start symbolL is the lexicon, a set of items of the form X -> xR is the grammar, a set of items of the form X -> $\gamma$e is the empty symbol

    Probabilistic Parsing

    Probabilistic - or stochastic - context-free grammars (PCFGs) G = (T, N, S, R, P) P is a probability function Chomsky Normal Form Reconstructing n-aries is easyReconstructing unaries/empties is trickierBinarization is crucial for cubic time CFG parsing Cocke-Kasami-Younger (CKY) Constituency Parsing Unaries can by incorporated into the algorithmEmpties can be incorporatedBinarization is vital Performance RobustPartial solution for grammar ambiguityGive a probabilistic language modelThe problem seems to be that PCFGs lack the lexicalization of a trigram model

    Lexicalized Parsing

    Charniak Probabilistic conditioning is “top-down” like a regular PCFG, but actual parsing is bottom-up, somewhat like the CKY algorithm we saw Non-Independence The independence assumptions of a PCFG are often too strongWe can relax independence assumptions by encoding dependencies into the PCFG symbols, by state splitting (sparseness) Accurate Unlexicalized Parsing Grammar rules are not systematically specified to the level of lexical itemsClosed vs. open class words Learning Latent Annotations brackets are knownbase categories are knowninduce subcategoriesclever split/merge category refinementEM, like Forward-Backward for HMMs, but constrained by tree

    Dependency Parsing

    Methods Dynamic programming (like in the CKY algorithm)Graph algorithmConstraint SatisfactionDeterministic Parsing Sources of information Bilexical affinitiesDependency distanceIntervening materialValency of heads MaltParser GreedyBottom upHas a stacka buffera set of dependency arcsa set of actions Each action is predicted by a discriminative classifier (SVM)No searchProvides close to state of the art parsing performanceProvides very fast linear time parsing Projective Dependencies from a CFG tree using heads, must be projectiveBut dependency theory normally does allow non-projective structure to account for displaced constituentsThe arc-eager algorithm only builds projective dependency trees Stanford Dependencies ProjectiveCan be generated by postprocessing headed phrase structure parses, or dependency parsers like MaltParser or the Easy-First Parser

    Information Retrieval

    Used for web searche-mail searchsearching your laptopcorporate knowledge baseslegal information retrieval

    Classic search

    User taskInfo needQuery & CollectionSearch engineResults

    Initial stages of text processing

    TokenizationNormalizationStemmingStop words

    Query processing

    AND “merge” algorithm Phrase queries Biword indexes false positivesbigger dictionary Positional indexes Extract inverted index entries for each distinct termMerge their doc:position lists to enumerate all positionsSame general method for proximity searches A positional index is 2-4 as large as a non-positional indexCaveat: all of this holds for “English-like” languageThese two approaches can be profitably combined

    Ranked Retrieval

    Advantage Free text querieslarge result sets are not an issue Query-document matching scores Jaccard coefficient Doesn’t consider term frequencyLength normalization needed Bag of words model Term frequency (tf)Log-frequency weightingInverse document frequency (idf)

    tf-idf weighting

    $$ W_{t,d}=(1+\log tf_{t,d})\times \log_{10}(N/df_t) $$

    Best known weighting scheme in information retrieval

    Distance: cosine(query, document)

    $$ \cos(\vec q,\vec d)=\tfrac{\vec q \bullet \vec d}{|\vec q||\vec d|}=\tfrac{\vec q}{|\vec q|}\bullet \tfrac{\vec d}{|\vec d|}=\tfrac{\sum{|V|}_{i=1}q_id_i}{\sqrt{\sum{|V|}{i=1}q_i2}\sqrt{\sum{|V|}{i=1}d^2_i}} $$

    Weighting

    Many search engines allow for different weightings for queries vs. documentsA very standard weighting scheme is: Inc.ItcDocument: logarithmic tf, no idf and cosine normalizationQuery: logarithmic tf idf, cosine normalization

    Evaluation

    Mean average precision (MAP)

    Semantic

    Situation

    Reminder: lemma and wordformHomonymy HomographsHomophones PolysemySynonymsAntonymsHyponymy and HypernymyHyponyms and Instances

    Applications of Thesauri and Ontologies

    Information ExtractionInformation RetrievalQuestion AnsweringBioinformatics and Medical InformaticsMachine Translation

    Word Similarity

    Synonymy and similaritySimilarity algorithm Thesaurus-based algorithmDistributional algorithms

    Thesaurus-based similarity

    $LCS(c_1,c_2)=$ The most informative (lowest) node in the hierarchy subsuming both $c_1$ and $c_2$$$ Sim_{path}(c_1,c_2)=\tfrac{1}{pathlen(c_1,c_2)} $$$$ Sim_{resnik}(c_1,c_2)=-\log P(LCS(c_1,c_2)) $$$$ Sim_{lin}(c_1,c_2)=\tfrac{1\log P(LCS(c_1,c_2))}{\log P(c_1)+\log P(c_2)} $$$$ Sim_{jiangconrath}(c_1,c_2)=\tfrac{1}{\log P(c_1)+\log P(c_2)-2\log P(LCS(c_1,c_2))} $$$$ Sim_{eLesk}(c_1,c_2)=\sum_{r,q\in RELS}overlap(gloss(r(c_1)),gloss(q(c_2))) $$

    Evaluating Intrinsic Correlation between algorithm and human word similarity ratings Extrinsic (task_based, end-to-end) Malapropism (Spelling error) detectionWSDEssay gradingTaking TOEFL multiple-choice vocabulary tests Problems We don’t have a thesaurus for every languagerecall missing wordsmissing phrasesmissing connections between sensesworks less well for verbs, adj.

    Distributional models of meaning

    For the term-document matrix: tf-idfFor the term-context matrix: Positive Pointwise Mutual Information (PPMI) is common

    $$ PMI(w_1,w_2)=\log_2\tfrac{P(w_1,w_2)}{P(w_1)P(w_2)} $$

    PMI is biased toward infrequent events various weighting schemesadd-one smoothing

    Question Answering

    Question processing Detect question type, answer type (NER), focus, relationsFormulate queries to send a search engine Passage Retrieval Retrieval ranked documentsBreak into suitable passages and re-rank Answer processing Extract candidate answersRank candidates

    Approaches

    Knowledge-based build a semantic representation of the queryMap from this semantics to query structured data or resources Hybrid build a shallow semantic representation of the querygenerate answer candidate using IR methodsScore each candidate using richer knowledge sources

    Answer type taxonomy

    6 coarse classes AbbreviationEntityDescriptionHumanLocationNumeric 50 finer classesDetection Hand-written rulesMachine LearningHybrids Features Question words and phrasesPart-of-speech tagsParse featuresNamed EntitiesSemantically related words

    Keyword selection

    Non-stop wordsNNP words in recognized named entitiesComplex nominals with their adjectival modifiersOther complex nominalsNouns with their adjectival modifiersOther nounsVerbsAdverbsQFW word (skipped in all previous steps)Other words

    Passage Retrieval

    IR engine retrieves documents using query termsSegment the documents into shorter unitsPassage ranking number of named entities of the right type in passagenumber of query words in passagenumber of question N-grams also in passageproximity of query keywords to each other in passagelongest sequence of question wordsrank of the document containing passage

    Features for ranking candidate answers

    answer type matchpattern matchquestion keywordskeyword distancenovelty factorapposition featurespunctuation locationsequences of question terms

    Common Evaluation Metrics

    AccuracyMean Reciprocal Rank (MRR)

    $$ MRR = \tfrac{\sum_{i=1}^N \tfrac{1}{rank_i}}{N} $$

    Summarization

    Applications outlines or abstractssummariesaction itemssimplifying Three stages content selectioninformation orderingsentence realization salient words tf-idftopic signature mutual informationlog-likelihood ratio (LLR)

    $$weight(w_i)=\begin{cases}1,& if -2\log \lambda(w_i)>10 \ 0,& otherwise\end{cases}$$

    Supervised content selection problem hard to get labeled training dataalignment difficultperformance not better than unsupervised algorithm ROUGE (Recall Oriented Understudy for Gisting Evaluation) Intrinsic metric for automatically evaluating summaries based on BLEUnot as good as human evaluationmuch more convenient

    $$ ROUGE-2=\tfrac{\sum_{x\in {RefSummaries}}\sum_{bigrams:i\in S}\min(count(i,X),count(i,S))}{\sum_{x\in{RefSummaries}}\sum_{bigrams:i\in S}count(i,S)} $$

    Maximal Marginal Relevance (MMR) Iteratively (greedily) Relevant: high cosine similarity to the queryNovel: low cosine similarity to the summary Stop when desired length Information Ordering Chronological orderingCoherenceTopical ordering
    转载请注明原文地址: https://ju.6miu.com/read-1300542.html
    最新回复(0)