从sklearn的sklearn.feature_extraction.text提取的分词语法:
def build_tokenizer(doc):
token_pattern=r"(?u)\b\w\w+\b"
token_pattern = re.compile(token_pattern)
return token_pattern.findall(doc)
tokens=build_tokenizer("you like who? who-is-you")
print [tokens]
输出为:
[['you', 'like', 'who', 'who', 'is', 'you']]