Sed, Grep, Awk

Sed, Grep, and Awk are tools widely used by Linux experts to retrieve desired data from text files, the results of other Linux commands, or webpage contents, and etc. We are going to use them to retrieve useful information from a web content.

First, we will use the sed to convert the contents of an HTML page into an ASCII file. The internet movie database (www.imdb.com) contains a list of the top 250 movies of all time in rank order. We will write a pipeline of sed commands to convert the raw HTML into a simpler ASCII formatted "pipe"-delimited text file.(to see what the raw HTML looks like, go here and "View Source" in your web browser). The ASCII file we create will have the following format:

Rank|Rating|Title|Year|Votes

While there are many ways to solve the problem of removing unnecessary HTML content from the raw HTML file, below is an outline of the steps that I applied by my sed script .

First I downloaded the content of the top 250 films by using the Unix GET command and piped the result of the GET command to my list of sed commands. For easiness I saved all my sed commands in a file named HtmltoAscii.sed. I exposed the content of that file below, and explained what each sed command is used for. The following pipeline of commands downloads the content of the file located at  http://www.imdb.com/chart/top, and parses it using sed commands and turns it into ASCII format mentioned above and saves it in a file named part1.dat.

----------------------------------------------------------------------------
GET -n http://www.imdb.com/chart/top | sed -f HtmltoAscii.sed > part1.dat
----------------------------------------------------------------------------


HtmltoAscii.sed file content:
----------------------------------------------------------------------------
#delete all lines except the one containing "Top 250 movies"
/Top 250 movies/!d

#remove everything up to first "</b></font></td></tr>"
s/^.*<\/b><\/font><\/td><\/tr>//

#remove everyting after the last "</tr>"
s/\(.*\)<\/tr>.*$/\1 /

#replace all "</tr>" with "#"
s/<\/tr>/#/g

#replace all html tags with a space
s/<[^>]*>/ /g

#replace two or more spaces with "|"
s/  [ ]*/|/g

#remove the trailing dots of all the rank fields
s/\([1-9][0-9]*\)\.|/\1|/g

#remove the parentheses in all the year fields
s/(\([1-9][0-9][0-9][0-9]\))/\1/g

#remove any commas in the votes fields
s/\([1-9][0-9]*\),\([0-9]*\)/\1\2/g

#delete the first and last "|" delimiter
s/^|\(.*\)|$/\1/

#replace each occurrence of the string |#| with a newline character
s/|#|/\n/g

#replace each occurrence of &#x27 with the single quote "'"
s/&#x27/'/g

#replace all of the form &#xHH, where HH is hex number, with an "*"
s/&#x[0-9A-F][0-9A-F]/*/g
----------------------------------------------------------------------------

Now our ASCII file is ready and in a format we can obtain useful data from, so we can start querying its content. Lets answer the following questions by using grep and other basic Unix commands except awk and sed.

1. List the titles of all the 2011 movies in the top 100.
     grep '|2011|' part1.dat | cut -d "|" -f 3 
2. Print the number of movies that use the same word twice in the title.
      cut -d "|" -f 3 < part1.dat | grep -c '^\(.*\) .*\1 \| \(.*\) .* \2 \| \(.*\) .* \3$'
3. Print the rank of each movie that contains a non-alphabetic character in its title (excluding spaces).  
   cut -d "|" -f 1,3 < part1.dat | grep '[1-9][0-9]*|.*[^a-zA-Z ].*' | cut -d "|" -f 1
4. Print the number of movies with less than 50000 votes.
      cut -d "|" -f 5 < part1.dat | grep -c '^[1-9][0-9]\{0,3\}$\|^[1-4][0-9]\{4,4\}$'

And now lets write awk scripts to do the following tasks with the movie database.  Let's use a single awk script per question, and no other tools.

1- Print the total number of votes across all moves. 
      awk -F"|" '{ tot += $5 }END{ print tot }' part1.dat 
2- Print the year that had the greatest number of votes.
      awk -F"|" '$5 > greatest { date=$4; greatest=$5 }END{ print date }' part1.dat
3- Print the average number of votes for movies above an 8.5 rating and the average number of votes for 
movies below an 8.5 rating on a single line.
      awk -F"|" '{ if($2<8.5){nBelow++; BelowSum+=$5}else if($2>8.5){nAbove++; AboveSum+=$5}} 
      END{ printf("%i %i\n",AboveSum/nAbove,BelowSum/nBelow) }' part1.dat
4- Print the average number of words in each title.  A word is any string of non-whitespace characters.
     awk -F"|" '{ wordCount += split ($3,wordsArray," ")} END{printf("%.3f\n", 
   wordCount/NR)}' part1.dat 
5- Print the most commonly used word in titles besides “The” and "the".
     awk -F"|" '{ split ($3,wordsArr," "); for(i in wordsArr { 
     wordCountArr[wordsArr[i]]++}}
     END{ for(word in wordCountArr){ if(wordCountArr[word] > maxCount && word !=  
     "The" && word != "the"){maxCount = wordCountArr[word]; winner=word}};  
     printf("%s\n",winner)}'      part1.dat
6- Print the movies with the longest and shortest titles on two lines
     awk -F"|" 'BEGIN {minLength=10000} { titleLen=length($3); if(titleLen<=minLength)
    {minLength=titleLen;minTitle=$3}if(titleLen>=maxLength){maxLength=titleLen;maxTitle=$3}}
     END{ printf("%s\n",maxTitle);printf("%s\n",minTitle)}' part1.dat


 

2 comments:

  1. the commands are correct but the content of the imdb page may have been changed. you ll need some small modifications in the code

    ReplyDelete

Python Line Profilers using Decorator Pattern

You can use any of the following decorators to profile your functions line by line.  The first one(Profiler1) is a decorator as a class and...