Stanford CoreNLP 3.6.0 使用入门

xiaoxiao2022-06-28 56

Stanford CoreNLP由Java语言开发，是斯坦福大学自然语言处理小组的研究成果。包含自然语言处理的基本任务：分词、词性标注、依存句法分析、命名实体识别等等，支持多语言。项目地址：GitHub

本文主要记录使用注意事项。

##Stage 1 首先我们要下载 CoreNLP Java包。

然后在Eclipse里面的Classpath里面引入jar文件。如下图红色框所示： CoreNLP只需要引入jar文件就可以了，不用再引入其他文件。

然后代码送上：

package ... import java.util.ArrayList; import java.util.List; import java.util.Properties; import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.util.CoreMap; public class NlpUtil { public static void main(String[] args) { // creates a StanfordCoreNLP object, with POS tagging, lemmatization, // NER, parsing, and coreference resolution Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // read some text in the text variable String text = "She went to America last week.";// Add your text here! // create an empty Annotation just with the given text Annotation document = new Annotation(text); // run all Annotators on this text pipeline.annotate(document); // these are all the sentences in this document // a CoreMap is essentially a Map that uses class objects as keys and // has values with custom types List<CoreMap> sentences = document.get(SentencesAnnotation.class); List<String> words = new ArrayList<>(); List<String> posTags = new ArrayList<>(); List<String> nerTags = new ArrayList<>(); for (CoreMap sentence : sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token : sentence.get(TokensAnnotation.class)) { // this is the text of the token String word = token.get(TextAnnotation.class); words.add(word); // this is the POS tag of the token String pos = token.get(PartOfSpeechAnnotation.class); posTags.add(pos); // this is the NER label of the token String ne = token.get(NamedEntityTagAnnotation.class); nerTags.add(ne); } } System.out.println(words.toString()); System.out.println(posTags.toString()); System.out.println(nerTags.toString()); } }

输出： [She, went, to, America, last, week, .] [PRP, VBD, TO, NNP, JJ, NN, .] [O, O, O, LOCATION, DATE, DATE, O]

就这样愉快的搞定。

##Stage 2 然而… 上面说的都是默认配置，默认是处理英文的，如果要进行中文处理，则还需要引入一个jar文件，下载stanford-chinese-corenlp-2016-01-19-models.jar：

Classpath：

然后代码送上：

package ... import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; public class NlpUtil { public static void main(String[] args) { String text = "马飚向伦古转达了***主席的诚挚祝贺和良好祝愿。马飚表示，中国和赞比亚建交50多年来，双方始终真诚友好、平等相待，友好合作结出累累硕果，给两国人民带来了实实在在的利益。中方高度重视中赞关系发展，愿以落实两国元首共识和中非合作论坛约翰内斯堡峰会成果为契机，推动中赞关系再上新台阶。"; Annotation document = new Annotation(text); StanfordCoreNLP corenlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties"); corenlp.annotate(document); corenlp.prettyPrint(document, System.out); } }

部分输出（新窗口打开可查看大图）：

处理中文时经常出现OutOfMemoryError异常，设置运行时虚拟机堆大小就好了：

就这样，用Stanford CoreNLP处理中文也愉快地搞定了。

转载请注明原文地址: https://ju.6miu.com/read-1124587.html

专利

最新回复(0)