Skip to content

Latest commit

 

History

History
289 lines (230 loc) · 10.2 KB

Step04.md

File metadata and controls

289 lines (230 loc) · 10.2 KB

Grokking grep

And probably gawking at awk while we are at it, which means regular expressions, too. Now we have two problems.

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." - Jamie Zawinski

If the file command is useful for finding file system entries based on their attributes, the grep command is good for finding files whose contents match a regular expression. You already know at least one regular expression, the wildcard * character from the CMD.EXE prompt and Windows Explorer. It means "match zero or more characters." We'll cover more on regular expressions, or "regexes," in a moment. \drfnd{grep}{search files} \drshl{CMD.EXE} \index{@\texttt{} (match zero or more characters)} \index{regular expressions}

First, an example of grep, showing all files in a directory with the pattern "is" in them:

\drcap{\texttt{grep} example}

~ $ touch a b c
~ $ echo This sequence of characters is called a \"string\". > d
~ $ cat d
This sequence of characters is called a "string".
~ $ ls
a  b  c  d
~ $ grep is *
d:This sequence of characters is called a "string".

Expressing Yourself Regularly{.unnumbered}

So what are "regular expressions?" Simply, they are patterns for matching "strings," which are sequences of "characters," e.g.: \index{regular expressions}

\drcap{A string}

This sequence of characters is called a "string".

That is a string. So is, "That is a string." And "That" and "T" and so on. In general (with many exceptions), the UNIX world view is that everything is composed of text (or "strings"), and that creating, changing, finding and passing around text is the primary mode of operation.

In the grep example, we can see a regular expression can be as simple as "is". It can also be as complicated as:

\drcap{Complex regular expression}

(?bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@f

That shows at least one attempt at being a very complete parser of valid HTTP URLs. Wow! What is all that? Now you see why you have two problems. Even if you get that all figured out, or if you actually sit and create something like that from scratch yourself (and it works!), imagine coming back six months later and trying to decipher it again.

There are literally whole web sites and books just on regular expressions. With variations they are used in all "UNIX" shells, Perl, Python, Javascript, Java, C# and more. So obviously (a) they are really useful, and (b) we're not going to cover all of regexes here.

There are so many things you can do, the only thing to remember is "regular expressions" when you think "I need to find things based on a pattern" and then research what it will take to define the pattern you want.

In the mean time, following are a few simple regex examples. Consider the file invoices:

\drcap{Invoices file}

~ $ cat invoices
Combine brakes  400
Combine motor   1500
Combine tires   2500
Tractor brakes  300
Tractor motor   1000
Tractor tires   2000
Truck   brakes  200
Truck   tires   400
Truck   tires   400
Truck   tires   400
Truck   winch   100

Let's find all lines with "tractor":

\drcap{Trying to find tractors}

~ $ grep tractor invoices

Huh, nothing was found. But this is UNIX-land, so we know it is sensitive - about case anyway:

\drcap{Trying to find tractors, part two}

~ $ grep Tractor invoices
Tractor brakes  300
Tractor motor   1000
Tractor tires   2000

Or we could just tell grep we are insensitive (to case, anyway):

\drcap{Let's be insensitive}

~ $ grep -i tractor invoices
Tractor brakes  300
Tractor motor   1000
Tractor tires   2000

And just to remind you about long-style parameters:

\drcap{Spelling out our insensitivity}

~ $ grep --ignore-case tractor invoices
Tractor brakes  300
Tractor motor   1000
Tractor tires   2000

But what lines are those on?

\drcap{Print the line numbers of matches}

~ $ grep -i -n tractor invoices
1:Tractor       motor   1000
2:Tractor       brakes  300
3:Tractor       tires   2000

To get more complicated, we can pass the -E parameter (for extended regular expressions) and start doing some really fun stuff. Let's look for lines with either "Tractor" or "Truck":

\drcap{Extended regular expressions}

~ $ grep -E "Tractor|Truck" invoices
Tractor brakes  300
Tractor motor   1000
Tractor tires   2000
Truck   brakes  200
Truck   tires   400
Truck   tires   400
Truck   tires   400
Truck   winch   100

For me, the following keep coming up when using regular expressions:

  • one|other - find one pattern or the other. \index{*@\texttt{"|} (match zero or more characters)} \index{regular expressions!\texttt{"|} (or)}

  • ^ - pattern for the beginning of a line. \index{*@\texttt{^{}} (beginning of line)} \index{regular expressions!\texttt{^{}} (beginning of line)}

  • $ - pattern for the end of a line. \index{*@\texttt{$} (end of line)} \index{regular expressions!\texttt{$} (end of line)}

  • ? - match exactly one character. \index{*@\texttt{?} (match one character)} \index{regular expressions!\texttt{?} (match one character)}

  • * - match zero or more characters. \index{@\texttt{} (match zero or more characters)} \index{regular expressions!\texttt{*} (match zero or more characters)}

  • + - match one or more characters. \index{*@\texttt{+} (match one or more characters)} \index{regular expressions!\texttt{+} (match one or more characters)}

  • [A-Z] - match any character in a range (in this case any uppercase Latin alphabetic character). \index{*@\texttt{[A-Z]} (match a character in range)} \index{regular expressions!\texttt{[A-Z]} (match a character in range)}

  • [n|y] - match one character or another (such as n or y here). \index{*@\texttt{[n"|y]} (match one character or other)} \index{regular expressions!\texttt{[n"|y]} (match one character or other)}

For example, to find the lines that end in 400:

\drcap{Find lines ending with 400}

$ grep  -E "^*400$" invoices
Combine brakes  400
Truck   tires   400
Truck   tires   400
Truck   tires   400

Groveling With grep{.unnumbered}

To recursively find all files that contain the string "pdfinfo":

\drcap{Recursive \texttt{grep}}

~ $ grep -R -i pdfinfo *
./FileCheckers/otschecker:# pdfinfo, too. If pdfinfo thinks it's junk, ...
./FileCheckers/otschecker:        pdfinfo=`pdfinfo -opw foo "$1" 2>&1 1...
./FileCheckers/otschecker:        if [ $rc != 0 -a "$pdfinfo" != "Comma...
./FileCheckers/pdfchecker:        # pdfinfo, too. If pdfinfo thinks it'...
./FileCheckers/pdfchecker:                pdfinfo=`pdfinfo "$1" > /dev/...
./FileCheckers/pdfpwdchecker:# pdfinfo, too. If pdfinfo thinks it's jun...
./FileCheckers/pdfpwdchecker:        pdfinfo=`pdfinfo -opw foo "$1" 2>&...
./FileCheckers/pdfpwdchecker:        if [ $rc != 0 -a "$pdfinfo" = "Com...
./FileCheckers/README.md:* ***[pdfinfo(1)](http://linux.die.net/man/1/p...

The above is functionally equivalent but much quicker than:

\drcap{Recursive \texttt{grep} is faster than \texttt{find ... -exec grep}}

~ $ find . -type f -exec grep -H -i pdfinfo \{\} \; 

Note: In general, if a command has its own "recursive" option (such as -R with grep), it is quicker to use that rather than to invoke the command repeatedly using find instead. \drfnd{find}{find files}

However, sometimes you can use find to filter down files to be checked before having grep read through them, and have that result in much quicker results.

For example, if you only wanted to check files that contain "pdfinfo" that have been created or modified since the last time you checked, it could be quicker to run something like:

\drcap{A better example of when to use \texttt{find ... -exec grep}}

~ $ find . ! -name pdfinfo.log -newer pdfinfo.log -type f -exec grep -H \
    -i pdfinfo \{\} \; > pdfinfo.log

This says to ignore files named pdfinfo.log (! -name pdfinfo.log) and otherwise look for files (-type f) containing "pdfinfo" (-exec grep -H -i pdfinfo) that haven't been checked since the last time pdfinfo.log was modified (-newer pdfinfo.log). In my tests the first run (which initially creates the pdfinfo.log file) ran in 30 seconds but subsequents runs took just a few seconds. This was because the number of files to be searched through all directories was big enough it paid to pre-filter the results with find before handing them to grep.

Gawking at awk{.unnumbered}

I don't have much to say about awk other than: \drscr{awk}

  1. It is named after its three authors, Aho, Weinberger and Kernighan, all three of whom are computer science greats from Bell Labs. The GNU version is called gawk, of course!

  2. It is a "data driven scripting language." That's a fancy way of saying it was written specifically with slicing and dicing text in mind.

  3. It generally is broken out when the typical "UNIX" commands and shell features like pipes and redirection aren't enough.

  4. Usually, if I start thinking of awk, I start thinking of a way to program the answer in another language such as Python, or reframe the question to get an answer not requiring awk.

That said, it is a powerful knife in the tool belt, and you should be aware it exists. If you are searching the internet and find an answer using awk that you can quickly adapt to your needs, use it.

To whet your taste, here is the type of "one-liner" for which awk is famous, in this case formatting and printing a report on user ids from /etc/passwd:

\drcap{awk example}

~ $ awk -F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd
username: root      uid:0
username: daemon    uid:1
username: bin       uid:2
username: sys       uid:3
username: sync      uid:4
username: games     uid:5
username: man       uid:6
username: lp        uid:7
username: mail      uid:8
username: news      uid:9
username: uucp      uid:10
...and so on...