Using awk

Arrays

awk provides one-dimensional arrays. An array is a list of variables that share a common name, and that are distinguished from one another by a subscript (that is, a number indicating their position in the list). You do not need to declare arrays and array elements; like variables, they spring into existence when you use them. An array subscript can be a number or a string.

As an example of a conventional numeric subscript, the following statement assigns the current input line to the NRth element of the array x:

   x[NR] = $0

In fact, it is possible in principle (though perhaps slow) to read the entire input into an array with the awk program like this:

        { x[NR] = $0 }
   END  { ... processing ... }

The first action records each input line in the array x, indexed by line number; processing is done in the END statement.

Array elements can also be named by nonnumeric values. An array like this, where the array subscript (that is, position within the array of a given member) is a string, is called an associative array. For example, the following program accumulates the total population of Asia and Africa into the associative array pop. The END action prints the total population of these two continents.

   /Asia/	{ pop["Asia"] += $3 }
   /Africa/	{ pop["Africa"] += $3 }
   END	{ print "Asian population in millions is", pop["Asia"]
   	  print "African population in millions is",
   	pop["Africa"] }

On the file countries, this program generates the following output:

   Asian population in millions is 1765
   African population in millions is 37

In this program, if we use pop[Asia] instead of pop["Asia"], the expression uses the value of the variable Asia as the subscript. Because the variable is uninitialized (does not exist), the values would have been accumulated in pop[""].

Suppose our task is to determine the total area in each continent of the file countries. Any expression can be used as a subscript in an array reference. Consider the following statement:

   area[$4] += $2

This program uses the string in the fourth field of the current input record to index the array area and in that entry accumulates the value of the second field:

   BEGIN            { FS = "\t" }
                    { area[$4] += $2 }
   END              { for (name in area)
                      print name, area[name] }

When you invoke this on the countries file, this program produces the following output:

   Africa 1888
   North America 7467
   South America 4358
   Asia 13611
   Australia 2968

Stipulating the FS character is necessary in order to prevent awk from interpreting strings containing white space (``South America'', for example) as two separate fields, in which case the following occurs:

   Africa 1888
   South 4358
   Asia 13611
   North 7467
   Australia 2968

(Note that the order of the output fields is different from that of the previous example. This illustrates an important quality of associative arrays, namely that the elements in the array are not stored in any particular order, as is the case with conventional arrays. While numeric indices can be used in associative arrays, they do not necessarily refer to sequentially ordered locations. In order to manipulate the elements sequentially, a loop must be established that will increment a pointer to the elements.)

The last example uses a form of the for statement that iterates over all defined subscripts of an array:

for (i in array) statement

This executes statement with the variable i set in turn to each value of i for which array[i] has been defined. The loop is executed once for each defined subscript, which are chosen in a random order. Results are unpredictable when i or array is altered during the loop.

awk does not provide multidimensional arrays, but it does permit a list of subscripts. They are combined into a single subscript with the values separated by an unlikely string (stored in the variable SUBSEP). For example, the following code creates an array that behaves like a two-dimensional array; the subscript is the concatenation of i, SUBSEP, and j:

   for (i = 1; i <= 10; i++)
   	for (j = 1; j <= 10; j++)
   		     arr[i,j] = . . .

You can determine whether a particular subscript i occurs in an array arr by testing the condition i in arr:

   if ("Africa" in area)...

This condition performs the test without the side effect of creating area["Africa"], which would happen if we used the following:

   if (area["Africa"] != "")...

Note that neither is a test of whether the array area contains an element with value "Africa".

It is also possible to split any string into fields in the elements of an array using the built-in function split (see ``Using strings and string functions''). The function splits the string s1:s2:s3 into three fields (using the separator :) and stores s1 in a[1], s2 in a[2], and s3 in a[3].

   split("s1:s2:s3", a, ":")

The number of fields found, here three, is returned as the value of split. The third argument of split is a regular expression to be used as the field separator. If the third argument is missing, FS is used as the field separator.

An array element can be deleted with the delete statement:

delete arrayname[subscript]