Basic Web Scraping in Perl

This tutorial is customized for an Ubuntu installation and will require the following:

# apt-get install wget
# apt-get install perl5
# apt-get install unzip

The following steps will get you started on building a simple web scraper in perl

You can create your own custom list or you can download a file from alexa.com containing the top 1 million visited websites

# wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

unzip the file:

# unzip top-1m.csv.zip

The unzipped file top-1m.csv should have the following format:

1,google.com
2,facebook.com
3,youtube.com
4,yahoo.com
5,baidu.com
...

In order to read a file, you need to open it and loop through each line:

open FH, "top-1m.csv";
while (my $line =<FH> ) {
  chomp $line;
  print "$line\n";
}
close FH;

Inside the while loop, use a regular expression to parse each line:

if ($line =~ /^(\d+),(.*)$/) {
  my $index = $1;
  my $website = $2;
}

Inside the if statement, call wget:

my $command = "wget -P $website/ $website";
print "$index: $command\n";
`$command`;

You may not want to scrape all 1 million websites, so you can add the following:

if ($index>100) {exit;} # this wil limit how many websites you scrape to the first 100