awk

awk is the most useful command for handling text files. It operates on an entire file line by line. By default it uses whitespace to separate the fields. The most common syntax for awk command is

awk '/search_pattern/ { action_to_take_if_pattern_matches; }' file_to_parse

awk是解释性语言，一行处理完，处理下一行

Print a Text File

awk '{ print }' /etc/passwd

awk '{ print $0 }' /etc/passwd

单引号中的被大括号括着的就是awk的语句，其只能被单引号包含。
其中的 $1..$ n表示第几列。注：$0表示整个行。

awk的格式化输出，和C语言的printf没什么两样

$ awk '{printf "%-8s %-8s %-8s %-18s %-22s %-15s\n", $1,$2,$3,$4,$5,$6}' netstat.txt

tcp      0        0        0.0.0.0:3306       0.0.0.0:*              LISTEN
tcp      0        0        0.0.0.0:80         0.0.0.0:*              LISTEN
tcp      0        0        127.0.0.1:9000     0.0.0.0:*              LISTEN
tcp      0        0        coolshell.cn:80    124.205.5.146:18245    TIME_WAIT

Print Specific Field

Use : as the input field separator and print first field only i.e. usernames (will print the the first field. all other fields are ignored):

awk -F':' '{ print $1 }' /etc/passwd

Send output to sort command using a shell pipe:

awk -F':' '{ print $1 }' /etc/passwd | sort

Pattern Matching

You can only print line of the file if pattern matched. For e.g. display all lines from Apache log file if HTTP error code is 500 (9th field logs status error code for each http request):

awk '$9 == 500 { print $0}' /var/log/httpd/access.log

The part outside the curly braces is called the “pattern”, and the part inside is the “action”. The comparison operators include the ones from C:

== != < > <= >= ?:

If no pattern is given, then the action applies to all lines. If no action is given, then the entire line is printed. If “print” is used all by itself, the entire line is printed. Thus, the following are equivalent:

awk '$9 == 500 ' /var/log/httpd/access.log
awk '$9 == 500 {print} ' /var/log/httpd/access.log
awk '$9 == 500 {print $0} ' /var/log/httpd/access.log

Print Lines Containing tom, jerry AND vivek
Print pattern possibly on separate lines:

awk '/tom|jerry|vivek/' /etc/passwd

Print 1st Line From File, (如果我们需要表头的话，我们可以引入内建变量NR)

awk "NR==1{print;exit}" /etc/resolv.conf
awk "NR==$line{print;exit}" /etc/resolv.conf

字符串匹配

$ awk '$6 ~ /FIN/ || NR==1 {print NR,$4,$5,$6}' OFS="\t" netstat.txt
1       Local-Address   Foreign-Address State
6       coolshell.cn:80 61.140.101.185:37538    FIN_WAIT2
9       coolshell.cn:80 116.234.127.77:11502    FIN_WAIT2
13      coolshell.cn:80 124.152.181.209:26825   FIN_WAIT1
18      coolshell.cn:80 117.136.20.85:50025     FIN_WAIT2

上面的第一个示例匹配FIN状态，第二个示例匹配WAIT字样的状态。其实 ~ 表示模式开始。/ /中是模式。这就是一个正则表达式的匹配。

模式取反的例子

$ awk '$6 !~ /WAIT/ || NR==1 {print NR,$4,$5,$6}' OFS="\t" netstat.txt
或者
awk '!/WAIT/' netstat.txt

1       Local-Address   Foreign-Address State
2       0.0.0.0:3306    0.0.0.0:*       LISTEN
3       0.0.0.0:80      0.0.0.0:*       LISTEN
4       127.0.0.1:9000  0.0.0.0:*       LISTEN

格式化输出

$ awk '$3==0 && $6=="LISTEN" || NR==1 {printf "%-20s %-20s %s\n",$4,$5,$6}' netstat.txt

Local-Address        Foreign-Address      State
0.0.0.0:3306         0.0.0.0:*            LISTEN
0.0.0.0:80           0.0.0.0:*            LISTEN
127.0.0.1:9000       0.0.0.0:*            LISTEN
:::22                :::*                 LISTEN

内建变量

$0 当前记录（这个变量中存放着整个行的内容）
$1~$n 当前记录的第n个字段，字段间由FS分隔
FS 输入字段分隔符默认是空格或Tab
NF 当前记录中的字段个数，就是有多少列
NR 已经读出的记录数，就是行号，从1开始，如果有多个文件话，这个值也是不断累加中。
FNR 当前记录数，与NR不同的是，这个值会是各个文件自己的行号
RS 输入的记录分隔符，默认为换行符
OFS 输出字段分隔符，默认也是空格
ORS 输出的记录分隔符，默认为换行符
FILENAME 当前输入文件的名字

指定分隔符

$  awk  'BEGIN{FS=":"} {print $1,$3,$6}' /etc/passwd
// 等价
$ awk  -F: '{print $1,$3,$6}' /etc/passwd
// 多个分隔符
awk -F '[;:]'

折分文件

awk拆分文件很简单，使用重定向就好了。下面这个例子，是按第6例分隔文件，相当的简单（其中的NR!=1表示不处理表头）。

$ awk 'NR!=1{print > $6}' netstat.txt
 
$ ls
ESTABLISHED  FIN_WAIT1  FIN_WAIT2  LAST_ACK  LISTEN  netstat.txt  TIME_WAIT

统计

计算所有的C文件，CPP文件和H文件的文件大小总和

$ ls -l  *.cpp *.c *.h | awk '{sum+=$5} END {print sum}'

// 统计各个connection状态的用法, 注意其中的数组的用法
$ awk 'NR!=1{a[$6]++;} END {for (i in a) print i ", " a[i];}' netstat.txt

TIME_WAIT, 3
FIN_WAIT1, 1
ESTABLISHED, 6
FIN_WAIT2, 3
LAST_ACK, 1
LISTEN, 4

// 统计每个用户的进程的占了多少内存
$ ps aux | awk 'NR!=1{a[$1]+=$6;} END { for(i in a) print i ", " a[i]"KB";}'
dbus, 540KB
mysql, 99928KB
www, 3264924KB

You get the sum of all the numbers in a column:

awk '{total += $1} END {print total}' earnings.txt

Shell cannot calculate with floating point numbers, but awk can:

awk 'BEGIN {printf "%.3f\n", 2005.50 / 3}'

awk脚本

BEGIN, END这两个关键字意味着执行前和执行后的意思，语法如下：

BEGIN{ 这里面放的是执行前的语句 }
END {这里面放的是处理完所有的行后要执行的语句 }
{这里面放的是处理每一行时要执行的语句}

假设有这么一个文件（学生成绩表）：
$ cat score.txt
Marry 2143 78 84 77
Jack 2321 66 78 45
Tom 2122 48 77 71
Mike 2537 87 97 95
Bob 2415 40 57 62

awk脚本如下

#!/bin/awk -f
#运行前
BEGIN {
    math = 0
    english = 0
    computer = 0
 
    printf "NAME    NO.   MATH  ENGLISH  COMPUTER   TOTAL\n"
    printf "---------------------------------------------\n"
}
#运行中
{
    math+=$3
    english+=$4
    computer+=$5
    printf "%-6s %-6s %4d %8d %8d %8d\n", $1, $2, $3,$4,$5, $3+$4+$5
}
#运行后
END {
    printf "---------------------------------------------\n"
    printf "  TOTAL:%10d %8d %8d \n", math, english, computer
    printf "AVERAGE:%10.2f %8.2f %8.2f\n", math/NR, english/NR, computer/NR
}

// 这样运行 ./cal.awk score.txt

环境变量

即然说到了脚本，我们来看看怎么和环境变量交互：（使用-v参数和ENVIRON，使用ENVIRON的环境变量需要export）

$ x=5
 
$ y=10
$ export y
 
$ echo $x $y
5 10
 
$ awk -v val=$x '{print $1, $2, $3, $4+val, $5+ENVIRON["y"]}' OFS="\t" score.txt
Marry   2143    78      89      87
Jack    2321    66      83      55
Tom     2122    48      82      81

Call AWK From Shell Script

A shell script to list all IP addresses that accessing your website. This script use awk for processing log file and verification is done using shell script commands.

#!/bin/bash
d=$1
OUT=/tmp/spam.ip.$$
HTTPDLOG="/www/$d/var/log/httpd/access.log"
[ $# -eq 0 ] && { echo "Usage: $0 domain-name"; exit 999; }
if [ -f $HTTPDLOG ];
then
    awk '{print}' $HTTPDLOG >$OUT
    awk '{ print $1}' $OUT  |  sort -n | uniq -c | sort -n
else
    echo "$HTTPDLOG not found. Make sure domain exists and setup correctly."
fi
/bin/rm -f $OUT

AWK and Shell Functions
Here is another example. chrootCpSupportFiles() find out the shared libraries required by each program (such as perl / php-cgi) or shared library specified on the command line and copy them to destination. This code calls awk to print selected fields from the ldd output:

chrootCpSupportFiles() {
# Set CHROOT directory name
local BASE="$1"         # JAIL ROOT
local pFILE="$2"        # copy bin file libs
 
[ ! -d $BASE ] && mkdir -p $BASE || :
 
FILES="$(ldd $pFILE | awk '{ print $3 }' |egrep -v ^'\(')"
for i in $FILES
do
  dcc="$(dirname $i)"
  [ ! -d $BASE$dcc ] && mkdir -p $BASE$dcc || :
  /bin/cp $i $BASE$dcc
done
 
sldl="$(ldd $pFILE | grep 'ld-linux' | awk '{ print $1}')"
sldlsubdir="$(dirname $sldl)"
if [ ! -f $BASE$sldl ];
then
        /bin/cp $sldl $BASE$sldlsubdir
else
        :
fi
}

This function can be called as follows:
chrootCpSupportFiles /lighttpd-jail /usr/local/bin/php-cgi

AWK and Shell Pipes

List your top 10 favorite commands:

history | awk '{print $2}' | sort | uniq -c | sort -rn | head

Sample Output:

172 ls
144 cd
69 vi
62 grep
41 dsu
36 yum
29 tail
28 netstat
21 mysql
20 cat
Another example to find out domain expiry date:

$ whois cyberciti.com | awk '/Registry Expiry Date:/ { print $4 }'

2018-07-31T18:42:58Z