Processing math: 100%

2016년 8월 20일 토요일

Web scraping using lynx and shell utilities

In 2016, many people would probably think of using Python modules such as BeautifulSoup, urllib, or requests for scraping and parsing web pages. While this is a good choice, in some cases it can be quicker to scrape web pages using the text browser lynx and parsing the results using grep, awk, and sed.

My use case is as follows: I want to programatically generate a list of rpm packages from Fedora's EPEL X (5, 6, 7), CentOS vault, CentOS mirror, and HP DL server firmware sites. I want this list to be comparable to the output of rpm -qa on RHEL machines. Here are some sample URL's for sites showing rpm package lists:

http://vault.centos.org/5.7/updates/x86_64/RPMS/
http://mirror.centos.org/centos-5/5.11/os/x86_64/CentOS/
https://dl.fedoraproject.org/pub/epel/6/x86_64/
http://mirror.centos.org/centos-7/7.2.1511/updates/x86_64/Packages/
http://downloads.linux.hpe.com/repo/spp/rhel/6/x86_64/2016.04.0_supspp_rhel6.8_x86_64/

If you visit any of these links you will find that the basic format is the same -- from the left, the first field is an icon, the second field is the rpm filename, the third field is the date in YYYY-MM-DD, the fourth field is time in HH:MM, and the fifth field is file size.

Here is my bash script which parses file list html pages into a simple text file:

#!/bin/bash
# http-rpmlist-parser.sh
# Copyright (C) 2016 Jun Go
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# Jun Go gojun077@gmail.com
# Last Updated: 2016-08-18
# This script uses lynx to render an html page containing a list
# of rpm filenames and output the raw text without html tags to
# a file. Then the raw text will be parsed using grep, awk, and
# sed to return a list of filenames that can be directly compared
# with the output of the RHEL command 'rpm - qa'
# USAGE: ./http-rpmlist-parser.sh [URL] [output file]
# EXAMPLE:
# /http-rpmlist-parser.sh \
# http://vault.centos.org/6.6/updates/x86_64/Packages/ \
# cent66-errata-list-clean.txt
F0="lynx-temp0.txt"
F1="lynx-temp1.txt"
F2="lynx-temp2.txt"
F3="lynx-temp3.txt"
TEMP=("${F0}"
"${F1}"
"${F2}"
"${F3}"
)
########################################
### Function for removing temp files ###
cleanup()
{
for i in ${TEMP[*]}; do
if [ -f "$i" ]; then
rm "$i"
else
echo "Cannot find temp file $i"
fi
done
}
########################################
if [ -z "$1" ]; then
echo "Please enter a URL to parse"
exit 1
elif [ -z "$2" ]; then
echo "Please specify an output file name"
exit 1
fi
# Check that lynx is installed on the system
if ! which lynx > /dev/null 2>&1; then
echo "This script requires lynx. Please install lynx and try again"
exit 1
fi
# Parse html into tagless text using lynx browser
lynx -dump -dont_wrap_pre -width=990 -nolist "$1" > "${F0}"
# Return lines containing the string '.rpm'
grep ".rpm" "${F0}" > "${F1}"
# replace all tabs with 4 spaces b/c
# awk will interpret [:space:] as FS
sed "s:\t: :g" "${F1}" > "${F2}"
# Extract the third field containing the filename
# Note that html pages containing file lists from EPEL, CentOS Vault,
# and HP all use the same format which consists of square brackets,
# package name, date, and file size (optional)
# [ ] fibreutils-3.2-6.x86_64.rpm 07-Jun-20
awk '{ print $3 }' "${F2}" > "${F3}"
# Remove the ".rpm" extension from each filename so that the file
# list is directly comparable to the output of 'rpm -qa'
sed "s:\.rpm::g" "${F3}" > "$2"
# remove temp files
cleanup

You can see that lynx renders the page from HTML into regular text and dumps this output to a file if you pass the -dump option. But this is not enough, because lynx by default inserts a newline character in lines greater than 79 characters. To avoid this problem, you must manually set the line width to something larger. The maximum width in lynx is 990 characters, so I specified this value through the option -width=990. Finally the -nolist option removes the list of links that lynx inserts at the bottom of the page.

Using grep I then extract just the lines containing the string ".rpm". Next I replace all tabs with 4 spaces using sed and then use awk to print just the filename field. Finally I use sed to remove the ".rpm" extension from the filenames to make the output identical to the format of rpm -qa. Note that the last sed statement might not render correctly in your browser because I use mathjax on my blog. Unfortunately, the characters I am trying to express are also the tags for a mathjax expression; The sed snippet should appear as follows:

sed "s:\openparens\.rpm\closeparens::g" "${F3}" > "$2"

I have replaced '(' and ')' with openparens and closeparens, respectively due to my blog's mathjax plugin incorrectly interpreting the above expression as a mathjax statement.

If you don't escape ".rpm" with backslashes, '.' will be interpreted as a regex "match any character" which would match strings like "-rpm", ".rpm", "redhat-rpm-config", etc. This is undesirable.

BTW this script is for informational and educational purposes only. It would actually be easier to just invoke lynx with lynx -dump -listonly ... and skip the data munging steps of replacing tabs with spaces using sed. If you do it this way you will get just the links to rpm files from EPEL, CentOS mirror, etc. Then you can return just the filename from each link's path with awk:

awk -F'/' '{ print $NF }'




댓글 없음:

댓글 쓰기