Jun's Pocket Plane: Web scraping using lynx and shell utilities

2016년 8월 20일 토요일

Web scraping using lynx and shell utilities

In 2016, many people would probably think of using Python modules such as BeautifulSoup, urllib, or requests for scraping and parsing web pages. While this is a good choice, in some cases it can be quicker to scrape web pages using the text browser lynx and parsing the results using grep, awk, and sed.

My use case is as follows: I want to programatically generate a list of rpm packages from Fedora's EPEL X (5, 6, 7), CentOS vault, CentOS mirror, and HP DL server firmware sites. I want this list to be comparable to the output of rpm -qa on RHEL machines. Here are some sample URL's for sites showing rpm package lists:

http://vault.centos.org/5.7/updates/x86_64/RPMS/
http://mirror.centos.org/centos-5/5.11/os/x86_64/CentOS/
https://dl.fedoraproject.org/pub/epel/6/x86_64/
http://mirror.centos.org/centos-7/7.2.1511/updates/x86_64/Packages/
http://downloads.linux.hpe.com/repo/spp/rhel/6/x86_64/2016.04.0_supspp_rhel6.8_x86_64/

If you visit any of these links you will find that the basic format is the same -- from the left, the first field is an icon, the second field is the rpm filename, the third field is the date in YYYY-MM-DD, the fourth field is time in HH:MM, and the fifth field is file size.

Here is my bash script which parses file list html pages into a simple text file:

You can see that lynx renders the page from HTML into regular text and dumps this output to a file if you pass the -dump option. But this is not enough, because lynx by default inserts a newline character in lines greater than 79 characters. To avoid this problem, you must manually set the line width to something larger. The maximum width in lynx is 990 characters, so I specified this value through the option -width=990. Finally the -nolist option removes the list of links that lynx inserts at the bottom of the page.

Using grep I then extract just the lines containing the string ".rpm". Next I replace all tabs with 4 spaces using sed and then use awk to print just the filename field. Finally I use sed to remove the ".rpm" extension from the filenames to make the output identical to the format of rpm -qa. Note that the last sed statement might not render correctly in your browser because I use mathjax on my blog. Unfortunately, the characters I am trying to express are also the tags for a mathjax expression; The sed snippet should appear as follows:

sed "s:\openparens\.rpm\closeparens::g" "${F3}" > "$2"

I have replaced '(' and ')' with openparens and closeparens, respectively due to my blog's mathjax plugin incorrectly interpreting the above expression as a mathjax statement.

If you don't escape ".rpm" with backslashes, '.' will be interpreted as a regex "match any character" which would match strings like "-rpm", ".rpm", "redhat-rpm-config", etc. This is undesirable.

BTW this script is for informational and educational purposes only. It would actually be easier to just invoke lynx with lynx -dump -listonly ... and skip the data munging steps of replacing tabs with spaces using sed. If you do it this way you will get just the links to rpm files from EPEL, CentOS mirror, etc. Then you can return just the filename from each link's path with awk:

awk -F'/' '{ print $NF }'

댓글 없음:

댓글 쓰기

GPG Public Key

-----BEGIN PGP PUBLIC KEY BLOCK-----

Version: GnuPG v2.0.22 (GNU/Linux)

mQENBE7vMHIBCADicCu8p52h2LRAaWZYLoR8BsKptqeJ9O5BWnDtobQGAFa5Xua7

FmrZmhYxKp2vzvlonWmloOP60Zgbxj9rt13S33SLthO+PcKneQkg4dBy/L8fxUaX

8L3n++I/i/qh4l7udUH9QoKNXeHDrAxgJfWcK4eXfImFkIc3EQhz/Ib7mEhIRSbP

gViohOjfwLNy07uf00DjEMvnlF/KY6LfoEEQUvIDmqembQrRXc2castWjL/Hjxae

seEOonMMuvPkvfcJrzfG8F7HJnRs+7e5/HNYA3iNap7JE1cb1huwXIqU7vh4Rd/R

Gq0kVOcGqszhyfuMNmbbbTNBEmzFmSBFtAGLABEBAAG0G0p1biBHbyA8Z29qdW4w

NzdAZ21haWwuY29tPokBOAQTAQIAIgUCTu8wcgIbAwYLCQgHAwIGFQgCCQoLBBYC

AwECHgECF4AACgkQpWkwz3I1E42NBggA19ciarF8DoPk+myx0AhEw8daDsZQ4sl4

j7EBDuB0hHDDthX2jKgPwqenYutF0+2EZQ5VS6kiyFCenK4wtYzkSVwlYuoiUbla

m0EPv8dA0f46/dxRhO2zoF8kfpmnR6BTR+EB+jVM+Mwpmc3shbpspnWPcRH/xTph

YLcURumTrfyIN++SeqSGcw20wg/+zqxclgOkwzZi9K4qIbdI9alPFsP14/xB+dcG

Ukows/TY7/eG3XGgiAE9tLLh99viBdLpSZ5T3GJlOGGnK+8EVZc9VG3yqxSURvBq

+X2CzYUmjsQUo0b7mQxmvFtWmjTbNc4lfWP3kkQdrUOQIe5J8rGfvLkBDQRO7zBy

AQgAsQC6mcxBHSZQzy8NwgZlQZSSx9zFjFVYkgr4xHym67PnkGs2opEvH0SawwMm

LM1/rCWVEeFcHQVaQ41z0Iu2WRIrqzbHreT730R8DqpYGICSp6wbPR5/AfVnwhcf

5I1Vos+cGzhW3kgsrpBCkKfhhDtRY5tseRm/TDMv1SGowsXVEIM/eSqvcNPkPa0f

am7Ah/sXYDg9om7wXbmLhPUz2RfPfHRYSDvRV9lIcvU0+jjVAwfpf3niPgZfsnU2

5smARZtjS1o0/pcFkrcKLE6VeVKFe2VxqJvFtKf4juxaIO1okxtwVcWfCbAGQplb

YiIYz2M7YuD5vqeexrEzxosveQARAQABiQEfBBgBAgAJBQJO7zByAhsMAAoJEKVp

MM9yNRONCb0H/1mV9EPt32R3ZYbUfO07V7GiMNYRZwfTW7ccGROwH1pzaI3ljQKM

FvXOmWg71yNTSiG9eBeSBIpLUXtIwmZvFzOG3B30msZBTStM605hZ9QV0PLxJNdm

61MlZ2EqFqTQYPMKz4Jsn5nZ9FH8wxUJ3QL5zMunE80AjQY4KV7cBswUKQjoDYVq

YIPVjsnchFduIcAMcpwKzTuMbqQih+mrjhr68Zusd44Lhr1g2qGQGCZXRn9/9oOQ

jBMXpeMhJMG/iyTdbO8PNbLFqu4QpHJJzRMphFVkFSBmlqDPcVgoeMazWhQMBg37

No+8Bq/f7QdNm+EJ/DHttuaJXDehVAFYnWE=

=PeRG

-----END PGP PUBLIC KEY BLOCK-----