Tag Archives: Java

Jython and my zipping drama (or really my unzip problem)

This post is really an effort to comfort myself after experiencing something of a late night conniption fit while on a business trip.  The goals of the trip were many but my role was simple, assist our Solution Architect in un-buggering a few problems at a client site in North Carolina. The buggering I’m referring to was frustrating bad performance of our product. Said product is an internal enterprise web-based solution for solving complex printer workflow problems with data. Ever wonder how your bank statements, cell phone bills, or even your IRS statements are print, stuffed, mailed, and tracked? We’ll those are the sorts of strangely exciting problems we work with.  I’m not being facetious here… this can be some seriously honest-fun geekery.

I was sitting in my hotel room bed with my laptop propped open on my lap, it was around 2 am, and I had a bad horror movie playing on the TV set, a Colorado beer near at hand, and my in-room jacuzzi was slowly filling with hot water… don’t ask, I tend to get into the zone when there’s  background distraction (I’ve always blamed it on my ADD). At 9am my colleague wanted to begin a series of load tests to begin zeroing in on the problem areas. The script I was working on would provide the means to throw seriously large amounts of data at the customers systems, we wanted to observe the systems when they were churning hard.

import os

... do a bunch of fairly nifty stuff...
os.popen("unzip large_zip_file.zip")
... do a whole lot more nifty stuff, before being mean to the server...

The script was meant to unzip an archive, modify several of the unzipped files, and then do nasty things to the servers by injecting the files into parts of our workflow. What was perplexing me was that the script seemed to work fine most of the time. Large zip files (2+ gb) seemed to illicit perplexing behavior sometimes. After being confused for awhile, it appeared that the unzipping wouldn’t quite finish before the remainder of the script would start to run. Before I go much further I should add some constraints to this exercise, I am stuck with Java 1.4 and Jython 2.5.0.

It made sense to try for a solution confined to Python’s api instead of reaching out to the OS.  A solution that still still didn’t work for my needs (code snippet credit goes to Corey Goldberg). Jython (at least 2.5.1) cannot handle large files, http://bugs.jython.org/issue1253. A Java OutOfMemory Error is thrown.

import zipfile

file_handler = open('foo.zip', 'rb')
zip_files = zipfile.ZipFile(file_handler)
for name in zip_files.namelist():
    outfile = open(name, 'wb')
    outfile.write(zip_files.read(name))
    outfile.close()
file_handler.close()

Time to try hacking something together in Java since I can harness the power of Java in Jython (code snippet credit goes to java_geek on StackOverflow). Again I was faced with an out of memory error along due to the limitation of the runtime environment… aargh!

import java.io.*;
import java.util.zip.*;

public class UnZip {
   final int BUFFER = 4096;
   public static void main (String argv[]) {
      try {
         BufferedOutputStream dest = null;
         FileInputStream fis = new FileInputStream(argv[0]);
         ZipInputStream zis = new ZipInputStream(new BufferedInputStream(fis));
         ZipEntry entry;
         while((entry = zis.getNextEntry()) != null) {
            System.out.println("Extracting: " +entry);
            int count;
            byte data[] = new byte[BUFFER];
            // write the files to the disk
            FileOutputStream fos = new FileOutputStream(entry.getName());
            dest = new BufferedOutputStream(fos, BUFFER);
            while ((count = zis.read(data, 0, BUFFER)) != -1) {
               dest.write(data, 0, count);
            }
            dest.flush();
            dest.close();
         }
         zis.close();
      } catch(Exception e) {
         e.printStackTrace();
      }
   }
}

The quick fix that I ended up implementing was to explicitly shell out a subprocess to ensure the command finished running. This is was suboptimal but I was tired.

import subprocess

unzip_file = subprocess.Popen("unzip " + "large_zip_file.zip", shell=True)
unzip_file.wait()

After getting home from the trip I came across this solution (credit goes to S.Lott on StackOverflow). Much cleaner and OS agnostic.

import zipfile
import zlib
import os

src = open( doc, "rb" )
zf = zipfile.ZipFile( src )
for m in  zf.infolist():

    # Examine the header
    print m.filename, m.header_offset, m.compress_size, repr(m.extra), repr(m.comment)
    src.seek( m.header_offset )
    src.read( 30 ) # Good to use struct to unpack this.
    nm= src.read( len(m.filename) )
    if len(m.extra) > 0: ex= src.read( len(m.extra) )
    if len(m.comment) > 0: cm= src.read( len(m.comment) )

    # Build a decompression object
    decomp= zlib.decompressobj(-15)

    # This can be done with a loop reading blocks
    out= open( m.filename, "wb" )
    result= decomp.decompress( src.read( m.compress_size ) )
    out.write( result )
    result = decomp.flush()
    out.write( result )
    # end of the loop
    out.close()

zf.close()
src.close()

V2 – a little WebDriver toss in some CraigsList

So I revisited the code that I originally wrote and did quite a bit of refactoring along with adding a stats counter that keeps track of pages visited and results found. The greatest inspiration came when my wife took interest and wanted to use it to do some job searches of her own, her only caveat was that she wanted the results to look a little prettier.

My next mini goal is to add the ability to parse arguments from the command-line so that it becomes a more general purpose tool.  I think that I’ll write a Jython wrapper to do this, I really like the simplicity of Python’s optparse library.

Git repository | Example output

Mix a little WebDriver toss in some CraigsList

A while ago I caught a wild hair to create a program to scrape CraigsList for tech jobs. Chris McMahon, fellow friend and creative QA extraordinaire is credited with turning me onto this idea.  He created essentially the same script but in ruby, what a crazy simple notion.  I tend to spend most of time in Python so I thought I’d jump ship back to my Java roots and try my hand at my own implementation.

I’ll be honest… I was shocked at how reliant I’ve become on Python’s built-in libraries and how much Java I’ve forgotten.  This was a super fun side project and I bet I’ll toss it into use. As I add updates to the code (this was my first draft) I’ll make the code available for download.

download code

import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.List;
`
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;

/**
 * The CraigsList class exposes a few simple services for doing job searches
 * across all craigslist sites in the US.
 */
public class CraigsList {

	private String logFileName;
	private final String URL = "http://geo.craigslist.org/iso/us";
	private FileWriter file;
	private PrintWriter out;

	public CraigsList(String logFileName) {
		this.logFileName = logFileName;
	}

	/**
	 * Searches all craigslist sites in the US for "software / qa / dba" jobs.
	 *
	 * @param searchTerm Search field value
	 * @param telecommuteJobs True, look for only telecommute jobs.
	 * False, to look for all jobs (telecommute or otherwise).
	 */
	public void doSoftwareJobSearch(String searchTerm,
									Boolean telecommuteJobs) {
		writePageHeader(searchTerm, "software / qa / dba");
		doJobSearch(searchTerm, telecommuteJobs, "sof/");
	}

	/**
	 * Searches all craigslist sites in the US for "web / info design" jobs.
	 *
	 * @param searchTerm Search field value
	 * @param telecommuteJobs True, look for only telecommute jobs.
	 * False, to look for all jobs (telecommute or otherwise).
	 */
	public void doWebDeveloperJobSearch(String searchTerm,
										Boolean telecommuteJobs) {
		writePageHeader(searchTerm, "web / info design");
		doJobSearch(searchTerm, telecommuteJobs, "web/");
	}

	/**
	 * Search all craigslist sites in US for specific jobs.
	 *
	 * @param searchTerm Search field value
	 * @param telecommuteJobs True, look for only telecommute jobs.
	 * False, to look
	 * @param resource The resource locator
	 */
	private void doJobSearch(String searchTerm, Boolean telecommuteJobs,
												String resource) {
		WebDriver driver = new HtmlUnitDriver();
		driver.get(URL);
		ArrayList links = findAllLinks(driver);
		for (String link : links) {
			driver.get(link + resource);
			driver.findElement(By.id("query")).sendKeys(searchTerm);
			if (telecommuteJobs) {
				driver.findElement(By.name("addOne")).click();
			}
			driver.findElement(By.xpath("//input[@value='Search']")).click();

			writeToLog("Page: " + link + "
");
			logAllLinksFromParagraphs(driver);
			writeToLog("
");
		}
		writeToLog("");
	}

	/**
	 * Pulls out the URI from an  tag.
	 *
	 * @param href the href tag to get the URI from
	 * @return the URI. Example - http://freecog.com
	 */
	private String getURI(String href) {
		href = href.replaceAll("", "");

		return href;
	}

	/**
	 * Finds all of the links on http://geo.craigslist.org/iso/us
	 * starter page except for: "craigslist", "w", and "or suggest
	 * a new one" links.
	 *
	 * @param driver A WebDriver object instantiated to a webpage.
	 * @return An list of all URIs on the webpage.
	 */
	private ArrayList findAllLinks(WebDriver driver) {
		ArrayList results = new ArrayList();
		List hrefs = driver.findElements(By.tagName("a"));
		for (WebElement href : hrefs) {
			String text = href.getText();
			if (!text.equals("craigslist") && !text.equals("w")
					&& !text.equals("or suggest a new one")) {
				String link = getURI(href.toString());
				results.add(link);
			}
		}

		return results;
	}

	/**
	 * Pulls all links contained in paragraphs and writes them to the log.
	 *
	 * @param driver A WebDriver object instantiated to a webpage.
	 */
	private void logAllLinksFromParagraphs(WebDriver driver) {
		List paragraphs = driver.findElements(By.tagName("p"));
		for (WebElement paragraph : paragraphs) {
			String href = paragraph.findElement(By.tagName("a")).toString();
			if (href.length() != 0) {
				String link = href + paragraph.getText() + "
";
				writeToLog(link);
			}
		}
	}

	/**
	 * Write the top of the html page.
	 *
	 * @param searchTerm Adds the searchTerm to the page title and
	 * creates a h3 element with the same information.
	 * @param jobCategory Adds the job category being searched to
	 * an h4 element
	 */
	private void writePageHeader(String searchTerm, String jobCategory) {
		header = "create a String with ."
		writeToLog(header);
	}

	/**
	 * Write a string to the log file.
	 *
	 * @param text The String to record.
	 */
	private void writeToLog(String text) {
		try {
			file = new FileWriter(logFileName, true);
		} catch (IOException e) {
		}

		out = new PrintWriter(file);
		out.println(text);
		out.close();
	}

	public static void main(String args[]) {
		//CraigsList qa = new CraigsList("qa_jobs.html");
		CraigsList py = new CraigsList("python_jobs.html");
		//qa.doSoftwareJobSearch("qa", true);
		py.doWebDeveloperJobSearch("python", true);
	}
}