TechBookReport logo

Keywords: Hadoop, Map Reduce, Java, Python, data processing

Title: Programming Pig

Author: Alan Gates

Publisher: O'Reilly

ISBN: 978-1449302641

Media: Book

Level: No Pig assumed, but some Hadoop knowledge required

Verdict: A solid introduction


Apache Pig is an open source platform for parallel data processing tasks on large datasets. It uses a programming language called Pig Latin to generate map-reduce sequences that are executed on an underlying parallel processing system, generally the Hadoop platform. By using Pig the complexity of programming and executing complex, parallel processing tasks on massive data sets is eased considerably. And, as you'd expect from the title of this book, this slim volume introduces the reader to both the platform and the language, as well as showing how to extend things using Java coded User Defined Functions.

No previous experience is assumed, and the book starts with a quick intro into the history, background and philosophy of Pig. Some knowledge of Hadoop is assumed, but not much as the intention is to abstract the hard stuff and leave the Pig using concentrating on the writing scripts in Pig Latin. Once the intro is done, the next chapter moves on to installing, configuring and running Pig. This includes instructions on how to run Pig scripts locally, as well as on a Hadoop cluster or on a cloud service.

The next couple of short chapters look at Grunt (you've got to love the way the porcine references are slotted into place), the Pig command shell environment, and then it's on to look at Pig's data model.

The introduction to Pig Latin really starts in chapter 5 and continues in chapter 6. At heart this is a batch programming language that uses SQL-like operations that filter and collapse streams of data. The relational operators - filter, group, order etc - should be familiar to anyone who's had to work with relational data in other environments. Although the underlying platform is Java, Pig Latin is clearly a scripting language and there's little in the way of boiler-plate code to worry about.

Aside from developing scripts, and showing how User Defined Functions and static Java functions can be added into the mix, the author doesn't neglect to discuss the testing and debugging of scripts. More advanced topics include using Pig with Python (now there's an interesting picture to paint), writing evaluation or filter functions in Java or Python, Load and Store functions and finally a look at how Pig fits in with the rest of the Hadoop family of projects.

There's a certain no nonsense feel about this book - this in spite of the temptation to make hog-related wise cracks. The writing is clear and concise, and the code straightforward to follow (which is what you'd expect given the stated aim of the Pig project). Given that the author is a member of the original Pig development team at Yahoo!, and has been involved in the transition to a successful Apache project there's a depth of experience in text that adds weight to it.

So, if you'd like to boast that you're a Pig programmer, this is a good place to get an introduction.

Hit the 'back' key in your browser to return to subject index page

Return to home page

Contents © TechBookReport 2012. Published April 11 2012