统计单词数量

sort 与 sorted 区别：

sort 是应用在 list 上的方法，sorted 可以对所有可迭代的对象进行排序操作。
list 的 sort 方法返回的是对已经存在的列表进行操作，无返回值，而内建函数 sorted 方法返回的是一个新的 list，而不是在原来的基础上进行的操作。

这里我们对文件夹内文本进行读取并且统计其中出现次数最多的单词作为最重点词汇，将它的计数返回打印出来
流程

判断是否为文件夹
定义一个列表 - 排除一些常用单词以及介词比如and is等等
这里用到translate和string的一些处理方法，除去所有符号之后去除前后空格形成列表，然后遍历统计即可，在字典中对读到的数据保存并count+1，这两个方法不懂的话可以参考我之前的文章
除开用os.isfile判断是否为文件外还需要用前文提及的splitext判断后缀是否为txt(自定义)文件
最后对字典数据进行降序排序，取出第一个数据即可

import os
import string

def count_words(dirpath):
    if not os.path.isdir(dirpath):
        print('please input legal dirpath!')
        return

    exclude_words = ['a', 'an', 'the', 'and', 'or', 'of', 'in', 'at', 'to', 'is','…' ]
    table = str.maketrans("", "", string.punctuation)
    for root, dirs, files in os.walk(dirpath):
        for name in files:
            filename = os.path.join(root, name)
            if not os.path.isfile(filename) or not os.path.splitext(filename)[1] == '.txt':
                print('diary < %s > format is not .txt' % filename)
                return
            f = open(filename, 'r', encoding='utf-8')
            data = f.read()
            words = data.translate(table).split()
            word_dict = dict()

            #这里是单词拼接，单词末尾是’-‘的单词将和下一个单词一起组成新的单词
            n = 0
            for word in words:
                word = word.lower()
                if word[-1] == '-':
                    m = word[:-1]
                    n = 1
                    break
                if n == 1:
                    word = m + word
                    n = 0
                if word in exclude_words:
                    continue
                if word in word_dict:
                    word_dict[word] += 1
                else:
                    word_dict[word] = 1
            f.close()
            word_dict = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)
            print("word_dict", type(word_dict))
            print('The most word in diary < %s > is: %s' % (name, word_dict[0]))

if __name__ == '__main__':
    count_words('diary')

更多代码详情参考我的Github

最后编辑于：2018.12.06 10:35:41