Perl, Hive and Pig, Oh My! Hadoop for Perl programmers
By Jud Dagnall (jud)
Date: Wednesday, 5 June 2013 09:00
Duration: 45 minutes
Target audience: Intermediate
Language: English
Tags: etl hadoop hive perl pig sql
At WhiteHat, we aggregate hierarchical time-series data as part of a data warehouse ETL pipeline. Hive and Pig are higher level languages for creating MapReduce jobs in Hadoop. Hive provides an SQL-like interface to your data, while Pig is simple but powerful data processing language. By taking advantage of the streaming capability built into both Hive and Pig, Perl developers can easily hook into an existing Hadoop infrastructure from the comfort and safely of their own language.
In this talk, we will present a brief overview of Hive, Pig and MapReduce, and then explore practical examples of how we can use Perl + Hadoop to solve some real-world problems. Along the way, we'll encounter tips for packaging your Perl code for Hadoop, see how different data types appear to our scripts, use the MapReduce framework to simplify our tasks, and learn to avoid some common pitfalls.
- Tim Bunce
- John Anderson (genehack)
- Craig Treptow (ctreptow)
- Karen Etheridge (Ether)
- Bryan Rivera
- Mateu Hunter (mateu)
- Robert Blackwell (rblackwe)
- Al Newkirk (alnewkirk)
- Jeremy Fluhmann (jfluhmann)
- Dana Jacobsen (danaj)
- Bradley Andersen (elohmrow)
- Athena Yao
- Stephen Wilcoxon (wilcoxon)
- Mike Fragassi (frag)
- Jeff Benton
- Gary Denslow
- Daniel Norman (abnorman)
- Steve Nolte (mcsnolte)
- Evan Staton
- Ricardo Signes (rjbs)
- parv
- Mike Covington
- Graham Knop (haarg)
- Matt Nash (mnb)
- Joe Papperello
- Stan Schwertly (stan_theman)
- Stephen McManus
- dean burnham
- Andrew Dougherty (aindilis)
- vroom
- Logan Bell (epochbell)
- Harika Tandra
- James Wilkus
- David Blumenthal
- Tim Heaney (oylenshpeegul)
- Ted Lanman
- Brad Adkins (badkins)
- Georgy Vladimirov
- Andrew Grangaard (spazm)
- Pippa Bindel
- Chuck Hardin
- David Delikat
- Rusty Bourland (saki)
- (Samuel) Kurt Newman