Using grep, curl, and tail to scrape data from a Web page

Posted on: Sunday, Feb 04, 2018

Linux has cool utilities!

Another post on this blog provides a Bash script that automates the installation of the most recent version of Firefox Developer Edition (FFDE). The original version of that script required the manual input of FFDE's latest version number. Looking up that number was a hassle to the say the least--and added lots of friction to a process that should simple and fast.

Rather than require you to look up the most recent version number and then provide that value as argument to the Bash script, the script now uses three of the really handy utilities that lurk within Linux. curl, grep, and tail work together to fetch the most recent version number from the FFDE downloads page. This post goes into the detail of how that script uses these Linux utilities to get the latest FFDE version number. With these in place, running that script is quite simple now.

You can read more about curl, grep, and tail here:

  • curl - transfer the contents of a URL
  • grep - find lines matching a pattern
  • tail - output the last part of files

Scraping data from a Web page

Mozilla provides a "releases" download page that shows the versions of FFDE available. The most recent version number is the last number in the list. Visit the releases page to see it. There isn't much to it, it's mostly just a list of version numbers.

Follow each of these steps by clicking the clipboard icon to copy a line to your clipboard then pasting it in a terminal session to run it.

In an open a terminal session pull down the FFDE release page's HTML with curl:

curl https://download-installer.cdn.mozilla.net/pub/devedition/releases/

Use curl to see the release page's HTML.

This script needs that HTML in a text file, so it uses curl's -o flag to specify an output file:

curl -o releases.txt https://download-installer.cdn.mozilla.net/pub/devedition/releases/

Use curl to save the release page's HTML in the releases.txt file.

With the releases.txt file available, we'll run grep against that file to extract the version numbers from it. To do so, grep uses a simple regular expression that matches a FFDE version number (59.0b6. for example), where [0-9] specifies a single numeric digit, \. looks for a single period (unescaped, the . means any character to regex), and [a-z] specifies a letter from between a and z.

grep -o '[0-9][0-9]\.[0-9][a-z][0-9]' releases.txt    

Use grep to see the version numbers in a terminal.

This list is all of the version numbers (with each one repeated twice), but we only need the last number (the most recent version) in the list. To get the last number, grep pipes its output into the tail command.

grep -o '[0-9][0-9]\.[0-9][a-z][0-9]' releases.txt | tail -1

Pipe grep's output into the tail command to get the last number (the last line of the file).

The last bit of this step is get the most recent version number into a Bash variable for use in a Bash script. This is done with Bash's substitution operator, $(...).

VERSION=$(grep -o '[0-9][0-9]\.[0-9][a-z][0-9]' releases.txt | tail -1) && echo $VERSION

Capture the most recent version number in a Bash variable and show it.

While that was a long explanation it distills down to three lines (including a line to delete the releases.txt file).

curl -o releases.txt https://download-installer.cdn.mozilla.net/pub/devedition/releases/
VERSION=$(grep -o '[0-9][0-9]\.[0-9][a-z][0-9]' releases.txt | tail -1)
rm releases.txt

The general technique here of pulling down from a page with curl and then parsing it with grep, tail (and whatever other Linux utilities you need to use) is very handy. Please let me know in the comments what tasks you're using Linux utilities for.




Add your comment
You email is never shared with anyone else.

© Copyright 2017 by Roger Pence. All rights reserved.