maandag 9 juni 2014

scraping html with selenium server

Introduction:

In this case I  tried to create a server that is capable of getting the values from an html file. The first part what is needed I describe here.

The first part is of course to get the html from the url. This part you can do with Selenium Server, this is an java library which integrates selenium into the virtual machine. This is actual not the hardest part. The hardest part I am working currently namely to get crappy html a bit xml compliant.

The example:

public class Scraper {
    private static  final String BROWSER = "*firefox";
    private static  final String SERVER = "localhost";
    private static  final int PORT = 4444;
    private static final String SLASHES = "//";
    private static final String SLASH = "/";

    private DefaultSelenium selenium;
    private String url;
    public Scraper(String url) {
        this.url = url;
    }

    public void startScraper() {
        //This is to separate the server from the page where you want to start
        int startPosition = url.indexOf(SLASHES)  + SLASHES.length();
        //this turns http://google.nl/search into google.nl/search
        String tmpUrl = url.substring(startPosition);
        int endPosition = tmpUrl.indexOf(SLASH);
        //this turns http://google.nl/search into google.nl
        String baseUrl = url.substring(0, endPosition + startPosition);
       //this turns http://google.nl/search into /search
        String pageUrl = url.substring(endPosition + startPosition, url.length());
        //instanciate selenium with a port, your browser and the baseUrl
        selenium = new DefaultSelenium(SERVER, PORT, BROWSER, baseUrl);
        try {
        //instanciate the selenium server 
            SeleniumServer server = new SeleniumServer();
            server.start();
            selenium.start();
            //open the page you like search in this example
            selenium.open(pageUrl);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

   //This method is the first part for xml compliancy.
   //In my case the data I want is allways in the body.
    public String createHTmlOutPutStream () {
        String htmlSource = selenium.getHtmlSource();
        int start = htmlSource.indexOf("<body");
        int stop = htmlSource.indexOf("</body>") + 7 ;
        htmlSource = htmlSource.substring(start,stop);
     
        return htmlSource;
    }

  // This method fills every textfield if you know the name
    public void populateTextField(String fieldName, String text) {
        selenium.type(fieldName, text);
   }
 
   // This method clicks the button for you
    public void clickButton(String buttonName) {
        selenium.click(buttonName);
    }

    // This method runs through every option in a dropdown list.
    public void chooseDropDownAll(String dropDownName) {
        String[] options =  selenium.getSelectOptions("dropDownName");
        for (String option : options) {
            selenium.select("dropDownName", option);
        }
    }
    // This method only picks one option in the dropdown list.
    public void chooseDropDownSpecific(String dropDownName, String optionName) {
            selenium.select("dropDownName", optionName);
        }

}

Conclusion:

With those two libraries:
       <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-firefox-driver</artifactId>
            <version>2.42.0</version>
        </dependency>
        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-server</artifactId>
            <version>2.42.0</version>

 you will be capable of doing this trick. This only a small set of possibilities to discover.
You should try it, the next thing what I want to do is to write a test that does the login and the first search for me on the project I am working on right now. This because I like programming and not logging in :).


Have fun!

Geen opmerkingen:

Een reactie posten