Scraping ?

by volkanuzun 6/13/2008 8:12:00 AM

In our campus environment, we do have a banner that we strongly encourage other sites on campus to add the banner inside an iframe. I had to write an application which gets the urls from the database, and check if the site is using or banner or log, here is what we expect to see in the sites:

<iframe src ="http://www.csusb.edu/banner2007" width="100%" height="90" frameborder="0" 
    scrolling="No" title="CSUSB main navigation">

 so i need a regular expression that checks the above signature, below is the function i wrote:

public bool IsUsingIFrame(string htmlSource)
    {
        if (htmlSource == string.Empty)
            return false;
        string strRegExIFrameCheck = "<iframe\\s+.*src\\s*=\\s*[\"']http://www.csusb.edu/banner2007/?[\"']";
        Regex regexIFrameCheck = new Regex(strRegExIFrameCheck);
        return regexIFrameCheck.IsMatch(htmlSource);
    }

It is a simple function which uses regular expressions to see if there is any pattern of iframe with the specified source. now the second part of the problem is connecting to the sites, to achieve this i use webrequest class as below

public string GetSource(string Url)
    {
        if (Url == String.Empty)
        {
            this.htmlSource = String.Empty;
            return String.Empty;
        }

        string htmlSource = String.Empty;

        if (!(Url.Contains("http")))
            Url = "http://" + Url;

        try
        {

            //create a web request
            WebRequest request = WebRequest.Create(Url);
            WebResponse response = request.GetResponse();
            Stream responseStream = response.GetResponseStream();
            //himm encoding ? do we need this if everything is english? dunnp
            Encoding utf8Encode = Encoding.GetEncoding("utf-8");
            StreamReader readStream = new StreamReader(responseStream, utf8Encode);
            htmlSource = readStream.ReadToEnd();
        }
        catch
        {
            htmlSource = String.Empty;
        }
       
        this.htmlSource = htmlSource.ToLower();
        return htmlSource;
    }

tomorrow i ll try to write a unit test agains this :)

have fun

Add comment


(Will show your Gravatar icon)  

  Country flag

biuquote
  • Comment
  • Preview
Loading



About the author

Volkan Uzun




E-mail me Send mail

Twitter

Calendar

<<  December 2008  >>
MoTuWeThFrSaSu
24252627282930
1234567
891011121314
15161718192021
22232425262728
2930311234

View posts in large calendar

Flickr Badge

www.flickr.com
This is a Flickr badge showing public photos from volkanuzun. Make your own badge here.

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

© Copyright 2008

Sign in