- Details
- Created: 15 August 2013
This tutorial is customized for an Ubuntu installation and will require the following:
# apt-get install wget
# apt-get install perl5
# apt-get install unzip
The following steps will get you started on building a simple web scraper in perl
1. Create a target list
You can create your own custom list or you can download a file from alexa.com containing the top 1 million visited websites
# wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
unzip the file:
# unzip top-1m.csv.zip
The unzipped file top-1m.csv should have the following format:
1,google.com 2,facebook.com 3,youtube.com 4,yahoo.com 5,baidu.com ...
2. Perl code to read a file
In order to read a file, you need to open it and loop through each line:
open FH, "top-1m.csv"; while (my $line =<FH> ) { chomp $line; print "$line\n"; }
close FH;
3. Perl code to parse each line
Inside the while loop, use a regular expression to parse each line:
if ($line =~ /^(\d+),(.*)$/) { my $index = $1; my $website = $2; }
4. Download webpages
Inside the if statement, call wget:
my $command = "wget -P $website/ $website"; print "$index: $command\n"; `$command`;
You may not want to scrape all 1 million websites, so you can add the following:
if ($index>100) {exit;} # this wil limit how many websites you scrape to the first 100