Linux Text Processing Commands

Linux Text Processing Commands

These are linux commands affecting text and text files. Original source: http://linux.die.net/abs-guide/textproc.html

[toc hidden:1]


These are linux commands affecting text and text files. Original source: http://linux.die.net/abs-guide/textproc.html

sort

File sort utility, often used as a filter in a pipe. This command sorts a text stream or file forwards or backwards, or according to various keys or character positions. Using the -m option, it merges presorted input files. The info page lists its many capabilities and options. See Example 10-9, Example 10-10, and Example A-8.

tsort

Topological sort, reading in pairs of whitespace-separated strings and sorting according to input patterns. The original purpose of tsort was to sort a list of dependencies for an obsolete version of the ld linker in an "ancient" version of UNIX.

The results of a tsort will usually differ markedly from those of the standard sort command, above.

uniq

This filter removes duplicate lines from a sorted file. It is often seen in a pipe coupled with sort.



The useful -c option prefixes each line of the input file with its number of occurrences.



The sort INPUTFILE | uniq -c | sort -nr command string produces a frequency of occurrence listing on the INPUTFILE file (the -nr options to sort cause a reverse numerical sort). This template finds use in analysis of log files and dictionary lists, and wherever the lexical structure of a document needs to be examined.

Example 15-11. Word Frequency Analysis





expand, unexpand

The expand filter converts tabs to spaces. It is often used in a pipe.

The unexpand filter converts spaces to tabs. This reverses the effect of expand.

cut

A tool for extracting fields from files. It is similar to the print $N command set in awk, but more limited. It may be simpler to use cut in a script than awk. Particularly important are the -d (delimiter) and -f (field specifier) options.

Using cut to obtain a listing of the mounted filesystems:



Using cut to list the OS and kernel version:



Using cut to extract message headers from an e-mail folder:



Using cut to parse a file:



cut -d ' ' -f2,3 filename is equivalent to awk -F'[ ]' '{ print $2, $3 }' filename

 

Note

It is even possible to specify a linefeed as a delimiter. The trick is to actually embed a linefeed (RETURN) in the command sequence.



Thank you, Jaka Kranjc, for pointing this out.

See also Example 15-43.

paste

Tool for merging together different files into a single, multi-column file. In combination with cut, useful for creating system log files.

join

Consider this a special-purpose cousin of paste. This powerful utility allows merging two files in a meaningful fashion, which essentially creates a simple version of a relational database.

The join command operates on exactly two files, but pastes together only those lines with a common tagged field (usually a numerical label), and writes the result to stdout. The files to be joined should be sorted according to the tagged field for the matchups to work properly.







 

Note

The tagged field appears only once in the output.

head

lists the beginning of a file — the default is 10 lines, but this can be changed — to stdout. The command has a number of interesting options.

Example 15-12. Which files are scripts?



Example 15-13. Generating 10-digit random numbers



See also Example 15-35.

 

tail

lists the end of a file — the default is 10 lines — to stdout. Commonly used to keep track of changes to a system logfile, using the -f option, which outputs lines appended to the file.

Example 15-14. Using tail to monitor the system log



 

Tip

To list a specific line of a text file, pipe the output of head to tail -n 1. For example head -n 8 database.txt | tail -n 1 lists the 8th line of the file database.txt.

To set a variable to a given block of a text file:



 

Note

Newer implementations of tail deprecate the older tail -$LINES filename usage. The standard tail -n $LINES filename is correct.

See also Example 15-5, Example 15-35 and Example 29-6.

grep

A multi-purpose file search tool that uses Regular Expressions. It was originally a command/filter in the venerable ed line editor: g/re/pglobal – regular expression – print.

 

grep pattern [file…]

Search the target file(s) for occurrences of pattern, where pattern may be literal text or a Regular Expression.

 



If no target file(s) specified, grep works as a filter on stdout, as in a pipe.



The -i option causes a case-insensitive search.

The -w option matches only whole words.

The -l option lists only the files in which matches were found, but not the matching lines.

The -r (recursive) option searches files in the current working directory and all subdirectories below it.

The -n option lists the matching lines, together with line numbers.



The -v (or --invert-match) option filters out matches.



The -c (--count) option gives a numerical count of matches, rather than actually listing the matches.



When invoked with more than one target file given, grep specifies which file contains matches.



 

Tip

To force grep to show the filename when searching only one target file, simply give /dev/null as the second file.



If there is a successful match, grep returns an exit status of 0, which makes it useful in a condition test in a script, especially in combination with the -q option to suppress output.



Example 29-6 demonstrates how to use grep to search for a word pattern in a system logfile.

Example 15-15. Emulating grep in a script



How can grep search for two (or more) separate patterns? What if you want grep to display all lines in a file or files that contain both "pattern1" and "pattern2"?

One method is to pipe the result of grep pattern1 to grep pattern2.

For example, given the following file:



Now, let’s search this file for lines containing both "file" and "text" . . .



egrepextended grep — is the same as grep -E. This uses a somewhat different, extended set of Regular Expressions, which can make the search a bit more flexible. It also allows the boolean | (or) operator.



fgrepfast grep — is the same as grep -F. It does a literal string search (no Regular Expressions), which usually speeds things up a bit.

 

Note

On some Linux distros, egrep and fgrep are symbolic links to, or aliases for grep, but invoked with the -E and -F options, respectively.

Example 15-16. Looking up definitions in Webster’s 1913 Dictionary



agrep (approximate grep) extends the capabilities of grep to approximate matching. The search string may differ by a specified number of characters from the resulting matches. This utility is not part of the core Linux distribution.

 

Tip

To search compressed files, use zgrep, zegrep, or zfgrep. These also work on non-compressed files, though slower than plain grep, egrep, fgrep. They are handy for searching through a mixed set of files, some compressed, some not.

To search bzipped files, use bzgrep.

look

The command look works like grep, but does a lookup on a "dictionary," a sorted word list. By default, look searches for a match in /usr/dict/words, but a different dictionary file may be specified.

Example 15-17. Checking words in a list for validity



sed, awk

Scripting languages especially suited for parsing text files and command output. May be embedded singly or in combination in pipes and shell scripts.

sed

Non-interactive "stream editor", permits using many ex commands in batch mode. It finds many uses in shell scripts.

awk

Programmable file extractor and formatter, good for manipulating and/or extracting fields (columns) in structured text files. Its syntax is similar to C.

wc

wc gives a "word count" on a file or I/O stream:



wc -w gives only the word count.

wc -l gives only the line count.

wc -c gives only the byte count.

wc -m gives only the character count.

wc -L gives only the length of the longest line.

Using wc to count how many .txt files are in current working directory:



Using wc to total up the size of all the files whose names begin with letters in the range d – h



Using wc to count the instances of the word "Linux" in the main source file for this book.



See also Example 15-35 and Example 19-8.

Certain commands include some of the functionality of wc as options.



tr – convert text from 1 form to another

character translation filter.

 

Caution

Must use quoting and/or brackets, as appropriate. Quotes prevent the shell from reinterpreting the special characters in tr command sequences. Brackets should be quoted to prevent expansion by the shell.

Either tr "A-Z" "*" <filename or tr A-Z \* <filename changes all the uppercase letters in filename to asterisks (writes to stdout). On some systems this may not work, but tr A-Z '[**]' will.

The -d option deletes a range of characters.



The --squeeze-repeats (or -s) option deletes all but the first instance of a string of consecutive characters. This option is useful for removing excess whitespace.



The -c "complement" option inverts the character set to match. With this option, tr acts only upon those characters not matching the specified set.



Note that tr recognizes POSIX character classes. [1]



Example 15-18. toupper: Transforms a file to all uppercase.



Example 15-19. lowercase: Changes all filenames in working directory to lowercase.



Example 15-20. du: DOS to UNIX text file conversion.



Example 15-21. rot13: ultra-weak encryption.



Example 15-22. Generating "Crypto-Quote" Puzzles



fold

A filter that wraps lines of input to a specified width. This is especially useful with the -s option, which breaks lines at word spaces (see Example 15-23 and Example A-1).

fmt

Simple-minded file formatter, used as a filter in a pipe to "wrap" long lines of text output.

Example 15-23. Formatted file listing.



See also Example 15-5.

 

Tip

A powerful alternative to fmt is Kamil Toman’s par utility, available from http://www.cs.berkeley.edu/~amc/Par/.

col

This deceptively named filter removes reverse line feeds from an input stream. It also attempts to replace whitespace with equivalent tabs. The chief use of col is in filtering the output from certain text processing utilities, such as groff and tbl.

column

Column formatter. This filter transforms list-type text output into a "pretty-printed" table by inserting tabs at appropriate places.

Example 15-24. Using column to format a directory listing



colrm

Column removal filter. This removes columns (characters) from a file and writes the file, lacking the range of specified columns, back to stdout. colrm 2 4 <filename removes the second through fourth characters from each line of the text file filename.

 

Caution

If the file contains tabs or nonprintable characters, this may cause unpredictable behavior. In such cases, consider using expand and unexpand in a pipe preceding colrm.

nl

Line numbering filter: nl filename lists filename to stdout, but inserts consecutive numbers at the beginning of each non-blank line. If filename omitted, operates on stdin.

The output of nl is very similar to cat -b, since, by default nl does not list blank lines.

Example 15-25. nl: A self-numbering script.



pr

Print formatting filter. This will paginate files (or stdout) into sections suitable for hard copy printing or viewing on screen. Various options permit row and column manipulation, joining lines, setting margins, numbering lines, adding page headers, and merging files, among other things. The pr command combines much of the functionality of nl, paste, fold, column, and expand.

pr -o 5 --width=65 fileZZZ | more gives a nice paginated listing to screen of fileZZZ with margins set at 5 and 65.

A particularly useful option is -d, forcing double-spacing (same effect as sed -G).

gettext

The GNU gettext package is a set of utilities for localizing and translating the text output of programs into foreign languages. While originally intended for C programs, it now supports quite a number of programming and scripting languages.

The gettext program works on shell scripts. See the info page.

msgfmt

A program for generating binary message catalogs. It is used for localization.

iconv

A utility for converting file(s) to a different encoding (character set). Its chief use is for localization.



recode

Consider this a fancier version of iconv, above. This very versatile utility for converting a file to a different encoding scheme. Note that recode> is not part of the standard Linux installation.

TeX, gs

TeX and Postscript are text markup languages used for preparing copy for printing or formatted video display.

TeX is Donald Knuth’s elaborate typsetting system. It is often convenient to write a shell script encapsulating all the options and arguments passed to one of these markup languages.

Ghostscript (gs) is a GPL-ed Postscript interpreter.

texexec – TeX to pdf

Utility for processing TeX and pdf files. Found in /usr/bin on many Linux distros, it is actually a shell wrapper that calls Perl to invoke Tex.



enscript – convert text to ps

Utility for converting plain text file to PostScript

For example, enscript filename.txt -p filename.ps produces the PostScript output file filename.ps.

groff, tbl, eqn

Yet another text markup and display formatting language is groff. This is the enhanced GNU version of the venerable UNIX roff/troff display and typesetting package. Manpages use groff.

The tbl table processing utility is considered part of groff, as its function is to convert table markup into groff commands.

The eqn equation processing utility is likewise part of groff, and its function is to convert equation markup into groff commands.

Example 15-26. manview: Viewing formatted manpages



lex, yacc

The lex lexical analyzer produces programs for pattern matching. This has been replaced by the nonproprietary flex on Linux systems.

The yacc utility creates a parser based on a set of specifications. This has been replaced by the nonproprietary bison on Linux systems.

Notes

[1]

This is only true of the GNU version of tr, not the generic version often found on commercial UNIX systems.