python--stanfordcorenlp
stanford core nlp 是一个用于nlp的工具库。它是用java写的,但是现在也为python提供了接口。前段时间笔者尝试在python中使用它:
首先引入stanfordcorenlp的包
在python文件中引用:
from stanfordcorenlp import StanfordCoreNLP
stanfordcorenlp 中只有 StanfordCoreNLP 一个类
获得StanfordCoreNLP 的对象:
创建StanfordCoreNLP 对象需要传入一个路径参数,从而获得一个存放相应jar包的文件夹:该文件夹下载地址:https://stanfordnlp.github.io/CoreNLP/download.html
笔者使用的是:stanford-corenlp-full-2016-10-31
nlp = StanfordCoreNLP(path) # 这里的path即是stanford-corenlp-full-2016-10-31 的路径
使用
它的使用非常简单
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(path)
sentence = "i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor ."
print(nlp.dependency_parse(sentence))
nlp.close()
但是直接运行会出错
$ python WordFormation.py
Traceback (most recent call last):
File "WordFormation.py", line 1, in <module>
from stanfordcorenlp import StanfordCoreNLP
ModuleNotFoundError: No module named 'stanfordcorenlp'
或是:
PermissionError: [Errno 1] Operation not permitted
因此使用root权限运行:成功获得dependency
$ sudo python WordFormation.py
[('ROOT', 0, 3), ('nsubj', 3, 1), ('aux', 3, 2), ('det', 5, 4), ('dobj', 3, 5), ('case', 9, 6), ('advmod', 8, 7), ('nummod', 9, 8), ('nmod', 5, 9), ('advmod', 3, 10), ('cc', 3, 11), ('nsubj', 14, 12), ('advmod', 14, 13), ('conj', 3, 14), ('advmod', 14, 15), ('case', 18, 16), ('det', 18, 17), ('nmod', 14, 18), ('case', 23, 19), ('det', 23, 20), ('amod', 23, 21), ('compound', 23, 22), ('nmod', 18, 23), ('case', 26, 24), ('det', 26, 25), ('nmod', 23, 26), ('punct', 3, 27)]
其中的那些数字代表的是第几个单词,但是它是从1开始数的,('ROOT', 0, 3) 中的0不代表sentence中的单词
StanfordCoreNLP 还有一些功能,比如词性标注等都可以使用
但是笔者没有从StanfordCoreNLP 类中获得可以进一步获得dependency的方法:比如复合名词修饰 nmod 在这里我只能获得 nmod 而不能获得修饰用的介词 nmod:for 的形式
笔者没能找到合适的方法,因此我决定改用java尝试一下
JAVA--stanfordcorenlp
java 的话,语句相应会复杂一些
首先引入相应的jar包:
由于笔者建的maven项目
pom.xml 中加入:
<properties>
<corenlp.version>3.9.2</corenlp.version>
</properties>
<dependencies>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>${corenlp.version}</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>${corenlp.version}</version>
<classifier>models</classifier>
</dependency>
</dependencies>
开始运行:
import edu.stanford.nlp.ling.CoreAnnotations;
import java.util.Properties;
public class StanfordEnglishNlpExample {
public static void main(String[] args) {
Properties props = new Properties();
// 设置相应的properties
props.put("annotators", "tokenize,ssplit,pos,parse,depparse");
props.put("tokenize.options", "ptb3Escaping=false");
props.put("parse.maxlen", "10000");
props.put("depparse.extradependencies", "SUBJ_ONLY");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // 获得StanfordCoreNLP 对象
String str = "i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .";
Annotation document = new Annotation(str);
pipeline.annotate(document);
CoreMap sentence = document.get(CoreAnnotations.SentencesAnnotation.class).get(0);
SemanticGraph dependency_graph = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class); // 获得依赖关系图
System.out.println("\n\nDependency Graph: " + dependency_graph.toString(SemanticGraph.OutputFormat.LIST));
}// 直接打印关系
}
获得结果:
Dependency Graph: root(ROOT-0, had-3)
nsubj(had-3, i-1)
aux(had-3, 've-2)
det(player-5, the-4)
dobj(had-3, player-5)
case(years-9, for-6)
advmod(2-8, about-7)
nummod(years-9, 2-8)
nmod:for(player-5, years-9)
advmod(had-3, now-10)
cc(had-3, and-11)
nsubj(performs-14, it-12)
advmod(performs-14, still-13)
conj:and(had-3, performs-14)
advmod(performs-14, nicely-15)
case(exception-18, with-16)
det(exception-18, the-17)
nmod:with(performs-14, exception-18)
case(sound-23, of-19)
det(sound-23, an-20)
amod(sound-23, occasional-21)
compound(sound-23, wwhhhrrr-22)
nmod:of(exception-18, sound-23)
case(motor-26, from-24)
det(motor-26, the-25)
nmod:from(sound-23, motor-26)
punct(had-3, .-27)
在这里面就可以看到nmod:with nmod:of 这样的依存关系了
但是其实还是有问题的:
对于以上代码中的SemanticGraph 对象 dependency_graph 来说
如果想要获得它的对象的依存关系
List<SemanticGraphEdge> list = dependencies.edgeListSorted();
这时候就会发现,这个list中没有root的关系
实际上,如果想要root 关系,只能通过从 dependency_graph 再获取root关系列表,这样的话,没有很好的顺序关系
因此用另一种方法来获得:
为了将工作做的更完整一些,这里笔者将完成词性标注工作
想要使用词性标注器,首先需要获得english-left3words-distsim.tagger文件,这个文件在stanford-corenlp-2016-10-31 中有,可以直接用。但是很有可能由于引用的jar包和使用的tagger文件的版本不一致导致错误。
实际上,在我们引入的stanford-corenlp-models的jar包里就有这个tagger文件,但是想要将它读出来需要一点工作
URL url = new URL("jar:file:"+ path +
"!/edu/stanford/nlp/models/pos-tagger/english-left3words/" +
"english-left3words-distsim.tagger");
# 这里的path是jar包的路径,!后面的是tagger文件在jar包内部路径
JarURLConnection jarURLConnection = (JarURLConnection) url.openConnection();
由于词性标注器,MaxentTagger 类构造器,可以传入路径,也可以传入InputStream 对象:
MaxentTagger tagger = new MaxentTagger(jarURLConnection.getInputStream());
成功获得对象:
public static void main(String[] args) throws java.net.MalformedURLException, IOException {
URL url = new URL("jar:file:"+ path +
"!/edu/stanford/nlp/models/pos-tagger/english-left3words/" +
"english-left3words-distsim.tagger");
JarURLConnection jarURLConnection = (JarURLConnection) url.openConnection();
MaxentTagger tagger = new MaxentTagger(jarURLConnection.getInputStream());
DependencyParser parser = DependencyParser.loadFromModelFile(DependencyParser.DEFAULT_MODEL); // 依存关系解析器
String review = "i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .";
String result = "[";
DocumentPreprocessor tockenizer = new DocumentPreprocessor(new StringReader(review)); // 将一段话,分成多个句子
for(List<HasWord> sentence: tockenizer){
List<TaggedWord> tagged = tagger.tagSentence(sentence); // 对句子中的词打标签
GrammaticalStructure gs = parser.predict(tagged);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed(); // 获得依赖关系
for(TypedDependency td: tdl){
result = result.concat(td.reln()+"("+td.gov()+", "+td.dep()+"),");
}
}
System.out.println(result.substring(0,result.length()-1)+"]");
}
获得结果:
[nsubj(had/VBD, i/FW),aux(had/VBD, 've/VBP),root(ROOT, had/VBD),det(player/NN, the/DT),dobj(had/VBD, player/NN),case(years/NNS, for/IN),advmod(2/CD, about/IN),nummod(years/NNS, 2/CD),nmod:for(player/NN, years/NNS),advmod(had/VBD, now/RB),cc(had/VBD, and/CC),nsubj(performs/VBZ, it/PRP),advmod(performs/VBZ, still/RB),conj:and(had/VBD, performs/VBZ),advmod(performs/VBZ, nicely/RB),case(exception/NN, with/IN),det(exception/NN, the/DT),nmod:with(performs/VBZ, exception/NN),case(sound/NN, of/IN),det(sound/NN, an/DT),amod(sound/NN, occasional/JJ),compound(sound/NN, wwhhhrrr/NN),nmod:of(exception/NN, sound/NN),case(motor/NN, from/IN),det(motor/NN, the/DT),nmod:from(sound/NN, motor/NN),punct(had/VBD, ./.)]