Hadoop Summit
James's notes :
Hadoop: A Brief History
¡¤ Doug Cutting
¡¤ Started with
Nutch in 2002 to 2004
o Initial goal was
web-scale, crawler-based search
o Distributed by necessity
o Sort/merge based
processing
o Demonstrated on 4 nodes
over 100M web pages.
o Was operational onerous.
¡°Real¡± Web scale was a ways away yet
¡¤ 2004 through
2006: Gestation period
o GFS & MapReduce
papers published (addressed the scale problems we were having)
o Add DFS and MapReduce to
Nutch
o Two part-time developers
over two years
o Ran on 20 nodes at
Internet Archive (IA) and UW
o Much easier to program
and run
o Scaled to several 100m
web pages
¡¤ 2006 to 2008:
Childhood
o Y! hired Doug Cutting
and a dedicated team to work on it reporting to E14 (Eric Baldeschwieler)
o Hadoop project split out
of Nutch
o Hit web scale in 2008
Yahoo Grid Team Perspective: Eric Baldeschwieler
¡¤ Grid is
Eric¡¯s team internal name
¡¤ Focus:
o On-demand, shared access
to vast pools of resources
o Support massive parallel
execution (2k nodes and roughly 10k processors)
o Data Intensive Super
Computing (DISC)
o Centrally provisioned
and managed
o Service-oriented,
elastic
o Utility for user and
researchers inside Y!
¡¤ Open Source
Stack
o Committed to open source
development
o Y! is Apache Platinum
Sponsor
¡¤ Project on
Eric¡¯s team:
o Hadoop:
¡ì Distributed
File System
¡ì MapReduce
Framework
¡ì Dynamic
Cluster Management (HOD)
¡¤ Allows
sharing of a Hadoop cluster with 100¡¯s of users at the same time.
¡¤ HOD: Hadoop
on Demand. Creates virtual clusters using Torq (open source resource
managers). Allocates cluster into many
virtual clusters.
o PIG
¡ì Parallel
Programming Language and Runtime
o Zookeeper:
¡ì High-availability
directory and configuration service
o Simon:
¡ì Cluster and
application monitoring
¡ì Collects
stats from 100¡¯s of clusters in parallel (fairly new so far). Also will be open sourced.
¡ì All will
eventually be part of Apache
¡ì Similar to
Ganglia but more configurable
¡ì Builds real
time reports.
¡ì Goal is to
use Hadoop to monitor Hadoop.
¡¤ Largest
production clusters are currently 2k nodes.
Working on more scaling. Don¡¯t
want to have just one cluster but want to run much bigger clusters. We¡¯re
investing heavily in scheduling to handle more concurrent jobs.
¡¤ Using 2 data
centers and moving to three soon.
¡¤ Working with
Carnegie Mellon University (Yahoo provided a container of 500 systems ¨C it
appears to be a Rackable Systems container)
¡¤ We¡¯re running
Megawatts of Hadoop
¡¤ Over 400
people express interest in this conference.
o About ½ the room running
Hadoop
o Just about the same
number running over 20 nodes
¡¤
About 15 to 20% running over 100 nodes
PIG: Web-Scale Processing
¡¤ Christopher
Olston
¡¤ The project
originated in Y! Research.
¡¤ Example data
analysis task: Find users that visit ¡°good¡± web pages.
¡¤ Christopher
points out that joins are hard to write in Hadoop and there are many ways of
writing joins and choosing a join technique is actually a problem that requires
some skill. Basically the same point
made by the DB community years ago. PIG
is a dataflow language that describes what you want to happen logically and
then map it to map/reduce. The language
of PIG is called Pig Latin
¡¤ Pig Latin
allows the declaration of ¡°views¡± (late bound queries)
¡¤ Pig Latin is
essentially a text form of a data flow graph.
It generates Hadoop Map/Reduce jobs.
o Operators: filter,
foreach ¡ generate, & group
o Binary operators: join,
cogroup (¡°more customizable type of join¡±), & union
o Also support split
operator
¡¤ How different
from SQL?
o It¡¯s a sequence of
simple steps rather than a declarative expression. SQL is declarative whereas Pig Latin says
what steps you want done in what order.
Much closer to imperative programming and, consequently, they argue it
is simpler.
o They argue that it¡¯s
easier to build a set of steps and work with each one at a time and slowly
build them up to a complete and correct language.
¡¤ PIG is
written as a language processing layer over Map/Reduce
¡¤ He propose
writing SQL as a processing layer over PIG but this code isn¡¯t yet written
¡¤ Is PIG+Hadoop
a DBMS? (there have been lots of blogs on this question :-))
o P+H only support
sequential scans super efficiently (no indexes or other access methods)
o P+H operate on any data
format (PIGS eat anything) whereas DBMS only run on data that they store
o P+H is a sequence of
steps rather than a sequence of constraints as used in DBMS
o P+H has custom
processing as a ¡°first class object¡± whereas UDFs were added to DBMSs later
¡¤ They want an
Eclipse development environment but don¡¯t have it running yet. Planning an
Eclipse Plugin.
¡¤ Team of 10
engineers currently working on it.
¡¤ New version
of PIG to come out next week will include ¡°explain¡± (shows mapping to
map/reduce jobs to help debug).
¡¤ Today PIG
does joins exactly one way. They are adding more join techniques. There aren¡¯t explicit stats tracked other
than file size. Next version will allow
user to specify. They will explore optimization.
JAQL: A Query Language for Jason
¡¤ Kevin Beyer
from IBM (did the DB2 Xquery implementation)
¡¤ Why use JSON?
o Want complete entities
in one place (non-normalized)
o Want evolvable schema
o Want standards support
o Didn¡¯t want a DOC markup
language (XML)
¡¤ Designed for
JSON data
¡¤ Functional
query language (few side effects)
¡¤ Core
operators: iteration, grouping, joining, combining, sorting, projection,
constructors (arrays, records, values), unesting, ..
¡¤ Operates on
anything that is JSON format or can be transformed to JSON and produces JSON or
any format that can be transformed from JSON.
¡¤ Planning to
o add indexing support
o Open source next summer
o Adding schema and
integrity support
DryadLINQ: Michael Isard (Msft Research)
¡¤ Implementation
performance:
o Rather than temp between
every stage, join them together and stream
o Makes failure recovery
more difficult but it¡¯s a good trade off
¡¤ Join and
split can be done with Map/Reduce but ugly to program and hard to avoid
performance penalty
¡¤ Dryad is more
general than Map/Reduce and addresses the above two issues
o Implements a uniform
state machine for scheduling and fault tolerance
¡¤ LINQ addresses
the programming model and makes it more access able
¡¤ Dryad
supports changing the resource allocation (number of servers used) dynamically
during job execution
¡¤ Generally,
Map/Reduce is complex so front-ends are being built to make it easier: e.g. PIG
& Sawzall
¡¤ Linq: General
purpose data-parallel programming constructs
¡¤ LINQ+C#
provides parsing, thype-checking, & is a lazy evaluator
o It builds an expression
tree and materializes data only when requested
¡¤ PLINQ:
supports parallelizing LINQ queries over many cores
¡¤ Lots of
interest in seeing this code out there in open source and interest in the
community to building upon it. Some
comments very positive about how far along the work is matched with more
negative comments on this being closed rather than open source available for
other to innovate upon.
X-Tracing Hadoop: Andy Konwinski
¡¤ Berkeley
student with the Berkeley RAD Lab
¡¤ Motivation:
Make Hadoop map/reduce jobs easier to understand and debug
¡¤ Approach:
X-trace Hadoop (500 lines of code)
¡¤ X-trace is a
path based tracing framework
¡¤ Generates an
event graph to capture causality of events across a network.
¡¤ Xtrace
collects: Report label, trace id, report id, hostname, timestamp, etc.
¡¤ What we get
from Xtrace:
o Deterministic causality
and concurrency
o Control over which
events get traced
o Cross-layer
o Low overhead (modest
sized traces produced)
o Modest implementation
complexity
¡¤ Want real,
high scale production data sets. Facebook has been very helpful but Andy is
after more data to show the value of the xtrace approach to Hadoop
debugging. Contact andyk@cs.berkeley.edu if you want
to contribute data.
ZooKeeper: Benjamin Reed (Yahoo Research)
¡¤ Distributed
consensus service
¡¤ Observation:
o Distributed systems need
coordination
o Programmers can¡¯t use
locks correctly
o Message based
coordination can be hard to use in some applications
¡¤ Wishes:
o Simple, robust, good
performance
o Tuned for read dominant
workloads
o Familiar models and
interface
o Wait-free
o Need to be able to wait
efficiently
¡¤ Google uses
Locks (Chubby) but we felt this was too complex an approach
¡¤ Design point:
start with a file system API model and strip out what is not needed
¡¤ Don¡¯t need:
o Partial reads &
writes
o Rename
¡¤ What we do
need:
o Ordered updates with
strong persistence guarantees
o Conditional updates
o Watches for data changes
o Ephemeral nodes
o Generated file names
(mktmp)
¡¤ Data model:
o Hierarchical name space
o Each znode has data and
children
o Data is read and written
in its entirety
¡¤ All API take
a path (no file handles and no open and close)
¡¤ Quorum based
updates with reads from any servers (you may get old data ¨C if you call sync
first, the next read will be current as of the point of time when the sync was
run at the oldest. All updates flow
through an elected leader (re-elected on failure).
¡¤ Written in
Java
¡¤ Started
oct/2006. Prototyped fall 2006. Initial implementation March 2007. Open sourced in Nov 2007.
¡¤ A Paxos
variant (modified multi-paxos)
¡¤ Zookeeper is
a software offering in Yahoo whereas Hadoop
HBase: Michael Stack (Powerset)
¡¤ Distributed
DB built on Hadoop core
¡¤ Modeled on
BigTable
¡¤ Same
advantages as BigTable:
o Column store
¡ì Efficient
compression
¡ì Support for
very wide tables when most columns aren¡¯t looked at together
o Nulls stored for free
o Cells are versioned
(cells addressed by row, col, and timestamp)
¡¤ No join
support