|
|
awk is especially useful for producing reports that summarize and format information. Suppose you want to produce a report from the file countries, that lists the continents alphabetically, and after each continent, its countries in decreasing order of population, like this:
Africa:
Sudan 19
Algeria 18
Asia:
China 866
India 637
CIS 262
Australia:
Australia 14
North America:
USA 219
Canada 24
South America:
Brazil 116
Argentina 26
As with many data processing tasks, it is much easier to produce
this report in several stages. First, create a list of
continent-country-population triples, in which each field is
separated by a colon. To do this, use the following program,
triples, which uses an array pop, indexed by
subscripts of the form 'continent:country' to store the population
of a given country.
The print statement in the END section of the program creates the list of continent-country-population triples that are piped to the sort routine:
BEGIN { FS = "\t" }
{ pop[$4 ":" $1] += $3 }
END { for (cc in pop)
print cc ":" pop[cc] | "sort -t: +0 -1 +2nr" }
The arguments for sort deserve special mention. The
-t: argument tells sort to use : as
its field separator. The +0 -1 arguments make the first
field the primary sort key. In general, +i -j makes
fields i+1, i+2, ..., j the
sort key. If -j is omitted, the fields from i+1
to the end of the record are used. The +2nr argument makes
the third field, numerically decreasing, the secondary sort key
(n is for numeric, r for reverse order). Invoked
on the file countries, this program produces as output:
Africa:Sudan:19 Africa:Algeria:18 Asia:China:866 Asia:India:637 Asia:CIS:262 Australia:Australia:14 North America:USA:219 North America:Canada:24 South America:Brazil:116 South America:Argentina:26This output is in the right order but the wrong format. To transform the output into the desired form, run it through a second awk program, format:
BEGIN { FS = ":" }
{ if ($1 != prev) {
print "\n" $1 ":"
prev = $1
}
printf "\t\t%-10s %6d\n", $2, $3
}
This is a control-break program that prints only the first
occurrence of a continent name and formats the country-population
lines associated with that continent in the desired manner. The
following command line produces the report:
As this example suggests, complex data transformation and formatting
tasks can often be reduced to a few simple awk and
sort operations.