题目
- Word Frequency
Write a bash script to calculate the frequency of each word in a text file words.txt
.
For simplicity sake, you may assume:
words.txt
contains only lowercase characters and space ' ' characters.
Each word must consist of lowercase characters only.
Words are separated by one or more whitespace characters.
Example:
Assume that words.txt
has the following content:
the day is sunny the the
the sunny is is
Your script should output the following, sorted by descending frequency:
the 4
is 3
sunny 2
day 1
注意点
Note:
- Don't worry about handling ties, it is guaranteed that each word's frequency count is unique.
- Could you write it in one-line using Unix pipes?
解法1
# Read from the file words.txt and output the word frequency list to stdout.
cat words.txt | awk -F ' ' '{ for(i=1; i<=NF; i++) print $i }' | sort | uniq -c | sort -n -r | awk -F ' ' '{ print $2, $1}'
The variable NF is set to the total number of fields in the input record.
解法2
cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -n -r | awk '{ print $2, $1 }'
-
tr -s ' ' '\n'
: 将多个' '
替换为单个\n
tr - translate or delete characters
-s, --squeeze-repeats replace each sequence of a repeated character that is listed in the last specified SET, with a single occurrence of that character
解法3:与解法2对比
cat words.txt | sed 's/\s/\n/g' | sort | uniq -c | sort -n -r | awk '{ if($2 != "") print $2, $1 }'
-
sed 's/ /\n/g'
: 将单个' '
替换为单个\n
- 如果有多个
' '
也就会生成多个'\n'
,但是我们只需要一个。 - 同时多生成的
'\n'
也会被计数。 -
if($2 != "")
:我们在awk
输出的时候对空行(换行符)进行检查。
解法4
awk '{ for (i=1; i<=NF; i++) { ++D[$i]; } } END { for (i in D) { print i, D[i] } }' words.txt | sort -n -r -k 2
引用和推荐阅读:
https://leetcode.com/problems/word-frequency/
https://unix.stackexchange.com/a/378550/323210
https://leetcode.com/problems/word-frequency/discuss/55443/My-simple-solution-(one-line-with-pipe)
该文章遵循创作共用版权协议 CC BY-NC 4.0,要求署名、非商业 、保持一致。在满足创作共用版权协议 CC BY-NC 4.0 的基础上可以转载,但请以超链接形式注明出处。文章仅代表作者的知识和看法,如有不同观点,可以回复并讨论。