Awk and sed

Now the fun begins

awk and sed are two programs that live within unix that allow you to do fairly complicated manipulations of data from the unix command line. In fact, they are not really technically programs, but you can think of them that way. They each act on more or less any kind of regular plaintext file.

ASCII files

As an aside, these plaintext files are called ASCII files -- you'll hear me use this expression sometimes. ASCII is the American Standard Code for Information Interchange, and simply means any file with regular, normal characters in it: letters, numbers, a small number of punctuation marks, etc. ASCII files are human readable -- you can look at it and understand it. A Word file, for example, is not human readable -- you need a special program (Word itself) to read those files. An ASCII file is much more portable and flexible, as it can be read by almost any piece of software, in addition to your eyes. It can be emailed without corruption, and the file sizes are small (unlike Word documents). There are lots of other advantages to ASCII files. For now, the biggest advantage is that any program can read and write ASCII files.

grep

grep is a kind of search tool that operates on ASCII files. If you want to search for a certain string within a file, grep is one way to do this. Of course, you could do it by opening the file in Emacs (for instance) and doing a bunch of clicking around -- no problem. But if you had to do that for 1000 files, you'd probably be better off writing a one line grep command to do it for you.

An example:
%> grep apple fruitlist.txt

will return each line of the file fruitlist.txt that contains the string apple. If there are multiple lines that contain "apple" you'll get them all.

Another nice thing about grep is that you can do a "not found" search. An example:
%> grep -v apple fruitlist.txt

will return each line of "fruitlist.txt" that does not contain "apple." This will come in handy.

If you want to exclude, for example, all the comment lines, you need to add quotes around the # so that grep doesn't get confused:
%> grep -v "#" fruitlist.txt

will print out the entire fruitlist.txt except for the comment lines (beginning with "#"). In fact, any other line that contains "#" will also be excluded, so be careful. If you've got a wandering "#" somewhere else in the file, it could cause trouble for you. For this reason, I tend to stay clear of "#" except at the beginnings of comment lines.

awk

awk is a programming language whose strength is handling data in ASCII files. It can do a million different complicated things, but mostly I use it for operating on data files that have columns that I want to do something to. The best way to explain it is simply to show you some examples.

I have modified the example file from Tuesday's class; get the new file here. Note that I have included a comment line to explain what the two columns are.

There's a lot that can be said about awk. The basics are pretty basic, though. Columns are indicated by "$" like so: $1,$2,$3.

awk will go through and do the same thing to each row of data, whether you've got 1 row, or 1000. Remember the rule of 10!

So, an example, using data2.dat. If I just want the data in the first column, I can now type this:
%> awk '{print $1}' data2.dat

Let's do something more complicated.

What if you wanted to do some math on these data? Let's say you wanted to add (column 1 x 10) and (sqrt of column 2). Just to be safe, you also want to print out the original columns 1 and 2 as well. You would write
%> awk '{print $1,$2,($1*10)+sqrt($2)}' data2.dat

Notice what happened to the very first line. You got the column headers for columns 1 and 2, and then the third column has a zero in it. This is because you tried to do math on a non-math value ("#column1", "#column2"), and awk handled it gracefully -- instead of crashing, it just put a zero there. On the other hand, it didn't give you a warning either. Something to be cognizant of.

You can do conditional/equality statements too. To print all values of column1 that are greater than zero you would do this:
%> awk '$1>0 {print $1}' data2.dat

There are a ton of other things that you can do with awk, but this is the core of it. More exercises lie ahead.

More awk information can be found here, among other places.

sed

sed is similar to awk in that it operates on a stream of ASCII data, typically input as a file.

Instead of doing math on an input line, though, sed does in-line (on the fly) editing.

Again, the best way to show the basic use of sed is with some examples.

The thing that I find most useful that sed does is search-and-replace.

What if I wanted to change the column headers of my input file? I could do
%> sed 's/column/buddha/g' data2.dat

Let's take this apart. The single quotes are useful for setting apart your sed strings; though not always necessary, it's good practice. "s" means substitute. The "/" characters are the delimiters. The first thing -- here it is "column" -- is what is being changed from. The second thing -- here it is "buddha" -- is what is being changed to. "g" means to do this for every instances of "column" in the line -- without the "g", sed will only substitute the first instance. (Try it without the "g" to see if I'm right!)

Yes, you could do this within a file by opening a file; doing search and replace; and then closing the file, but sedcan do all this in a single command. More powerfully, sed can do this to 1000 files that are each 1000 lines long in a single command or set of commands; this would take you all day to do interactively.

This is the key to learning these programming languages and scripting tools: you could do any of these steps manually, if you wanted to, but it would take you forever. Instead, learn a powerful tool and learn it well, and save lots of frustrated manual work later on. As a worst case example, let's say you did the manual search-and-replace, and the realized you had the wrong string in the replace .. and had to go back and do it all over again! A whole day wasted. The same mistake with sed means you just run the command again -- total cost about 5 seconds.

Of course, sed has lots of other functionality as well, but this is the basic stuff.

More sed info can be found here, among many other places.

Exercise

Now go do the exercise.