Simple Text Processing

Count Number of Files in Each Subdirectory

START=$HOME   
# change your directory to command line if passed 
# otherwise use home directory  
[ $# -eq 1 ] && START=$1 || :   if [ ! -d $START ] then
    echo "$START not a directory!"
    exit 1
fi
 
# use find command to get all subdirs name in DIRS variable
DIRS=$(find "$START" -type d)
 

 
[toc hidden:1]


Count Number of Files in Each Subdirectory

START=$HOME   
# change your directory to command line if passed 
# otherwise use home directory  
[ $# -eq 1 ] && START=$1 || :   if [ ! -d $START ] then
    echo "$START not a directory!"
    exit 1
fi
 
# use find command to get all subdirs name in DIRS variable
DIRS=$(find "$START" -type d)
 
# loop thought each dir to get the number of files in each of subdir
for d in $DIRS
do
   [ "$d" != "." -a "$d" != ".." ] &&  echo "$d dirctory has $(ls -l $d | wc -l) files" || :
done

Count the number of lines in a file or a set of files

Count lines in one file ignoring blank lines:

$ cat myfile.txt | sed '/^\s*$/d' | wc -l

Count lines in a set of files (in a given directory) ignoring blank lines:
$ cat mydir/* | sed '/^\s*$/d' | wc -l

 

Count number of words in a file

Count number of words in all files in the current directory. 
$ wc -w *

Count number of words in the given file 
$ wc -w <input.txt> 

Count list of open files per user

/usr/sbin/lsof | grep 'user' | awk '{print $NF}' | sort | wc -l

 

Convert a text file to all uppercase/lowercase

dd
$ dd if=input.txt of=output.txt conv=lcase

tr
$ tr '[:upper:]' '[:lower:]' < input.txt > output.txt

Compress entire directory with files

compress

$ tar -zcvf archive-name.tar.gz directory-name

uncompress

$ tar -zxvf archive-name.tar.gz

-z: Compress archive using gzip program 
-c: Create archive
 -v: Verbose i.e display progress while creating archive
 -f: Archive File name

 

Delete line feed and Return in a text/html file

cat <my_text> | tr -d "\n" | tr -d "\r"  > test.txt
 

Remove all html tags from file

sed -n '/^$/!{s/<[^>]*>/ /g;p;}'  in.txt > out_nohtml.txt

Export a mysql dump

mysqldump -u username -ppassword database_name > dump.sql

Remove empty lines in a file

Remove empty lines in all files in a directory:
sed -i -e '/^$/d' <directory>/*

Remove empty lines in a specific file:
sed -i -e '/^$/d' filename.txt

Select specific lines from file

 grep -E  "your selection pattern: regex or text" in.txt > out.txt

 

Sort contents of textfile

descending order
$ sort -r in.txt > out.txt

ascending order
$ sort in.txt > out.txt
 

Generate counts of words in descending order of term frequency

1. First convert all capital letters to lower cases.
$ tr '[A-Z]' '[a-z]' < my_text_file> my_text_file.lowercase 2. Split the words on a given line so that each line has only one word.
$ awk '{for (i=1;i<=NF;i++) print $i;}' my_text_file.lowercase > my_text_file.onewordperline 3. Sort all the words and then count the number of occurrences of each word.
$ sort my_text_file.onewordperline | uniq -c > my_text_file.count 4. Sort the words in descending order of counts so you see the high frequency words.
$ sort -rn -k1 my_text_file.count > my_text_file.countsorted" All steps above in a combined way: $ tr '[A-Z]' '[a-z]' < my_text_file| awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 > my_text_file.countsorted
 

 

Move files from one extension to another

$ for f in *.unwanted-ext; do mv "$f" "`basename "$f" .unwanted-ext`.wanted-ext"; done;

Example: move from .txt to .properties
$ for f in *.txt; do mv "$f" "`basename "$f" .txt`.properties"; done;

Import a MySQL dump

$  mysql -u root -p -h <hostname> <db_name> < <dump_file>
Example:
$ mysql -u root -p -h localhost my_db < dump.txt

 

SSH Tunnel to connect to a remote MySQL Server

Simply open a bash prompt on Cygwin and type:

ssh -N -L 5000:localhost:3306 your-server &

e.g:

ssh -N -L 5000:0.0.0.0:3306 user@tim1.cs.uiuc.edu

Replace your-server with the machine name of the MySQL server. Now, it's possible to connect to the MySQL database using 127.0.0.1:5000 using a mysql client.

 

References

CS 410 Text Information Systems