Database and MapReduce
Database and MapReduce
RoadMap
• Role of relational databases in today’s organizations
– Where does MapReduce fit in?
• MapReduce algorithms for processing relational data
– How do I perform a join, etc.?
• Evolving roles of relational databases and MapReduce
– What’s in store for the future?
Relational Database Basics
Basic Structure
• Formally, given sets D1, D2, …. Dn a relation r is a subset of
D1 x D2 x … x Dn
Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai Di
• Example:
customer_name = {Jones, Smith, Curry, Lindsay}
customer_street = {Main, North, Park}
customer_city = {Harrison, Rye, Pittsfield}
Then r = { (Jones, Main, Harrison),
(Smith, North, Rye),
(Curry, North, Rye),
(Lindsay, Park, Pittsfield) }
is a relation over
customer_name , customer_street, customer_city
Relation Schema
• A1, A2, …, An are attributes
• R = (A1, A2, …, An ) is a relation schema
Example:
Customer_schema = (customer_name, customer_street, customer_city)
• r(R) is a relation on the relation schema R
Example:
customer (Customer_schema)
Relation Instance
• The current values (relation instance) of a relation are specified by a table
• An element t of r is a tuple, represented by a row in a table
Database
• A database consists of multiple relations
• Information about an enterprise is broken up into parts, with each relation storing one part of the information
account : stores information about accounts
depositor : stores information about which customer
owns which account
customer : stores information about customers
• Storing all information as a single relation such as
bank(account_number, balance, customer_name, ..)
results in repetition of information (e.g., two customers own an account) and the need for null values (e.g., represent a customer without an account)
Banking Example
branch (branch-name, branch-city, assets)
customer (customer-name, customer-street, customer-city)
account (account-number, branch-name, balance)
loan (loan-number, branch-name, amount)
depositor (customer-name, account-number)
borrower (customer-name, loan-number)
Relational Algebra
• Primitives
– Projection ()
– Selection ()
– Cartesian product ()
– Set union ()
– Set difference ()
– Rename ()
• Other operations
– Join (⋈)
– Group by… aggregation
– …
Big Data Analysis
• Peta-scale datasets are everywhere:
– Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
– eBay has 6.5 PB of user data + 50 TB/day (5/2009)
– …
• A lot of these datasets are (mostly) structured
– Query logs
– Point-of-sale records
– User data (e.g., demographics)
– …
• How do we perform data analysis at scale?
– Relational databases and SQL
– MapReduce (Hadoop)
Relational Databases vs. MapReduce
• Relational databases:
– Multipurpose: analysis and transactions; batch and interactive
– Data integrity via ACID transactions
– Lots of tools in software ecosystem (for ingesting, reporting, etc.)
– Supports SQL (and SQL integration, e.g., JDBC)
– Automatic SQL query optimization
• MapReduce (Hadoop):
– Designed for large clusters, fault tolerant
– Data is accessed in “native format”
– Supports many query languages
– Programmers retain control over performance
– Open source
Database Workloads
• OLTP (online transaction processing)
– Typical applications: e-commerce, banking, airline reservations
– User facing: real-time, low latency, highly-concurrent
– Tasks: relatively small set of “standard” transactional queries
– Data access pattern: random reads, updates, writes (involving relatively small amounts of data)
• OLAP (online analytical processing)
– Typical applications: business intelligence, data mining
– Back-end processing: batch workloads, less concurrency
– Tasks: complex analytical queries, often ad hoc
– Data access pattern: table scans, large amounts of data involved per query
One Database or Two?
• Downsides of co-existing OLTP and OLAP workloads
– Poor memory management
– Conflicting data access patterns
– Variable latency
• Solution: separate databases
– User-facing OLTP database for high-volume transactions
– Data warehouse for OLAP workloads
– How do we connect the two?
OLTP/OLAP Architecture
OLTP/OLAP Integration
• OLTP database for user-facing transactions
– Retain records of all activity
– Periodic ETL (e.g., nightly)
• Extract-Transform-Load (ETL)
– Extract records from source
– Transform: clean data, check integrity, aggregate, etc.
– Load into OLAP database
• OLAP database for data warehousing
– Business intelligence: reporting, ad hoc queries, data mining, etc.
– Feedback to improve OLTP services
Warehouse Models & Operators
• Data Models
– relations
– stars & snowflakes
– cubes
• Operators
– slice & dice
– roll-up, drill down
– pivoting
– other
Star
Star Schema
Terms
• Fact table
• Dimension tables
• Measures
Dimension Hierarchies
Cube
3-D Cube
ROLAP vs. MOLAP
• ROLAP:
Relational On-Line Analytical Processing
• MOLAP:
Multi-Dimensional On-Line Analytical Processing
Typical OLAP Queries
• The typical OLAP query will:
Start with a star join.
Select for interesting tuples, based on dimension data.
Group by one or more dimensions.
Aggregate certain attributes of the result.
Aggregates
Aggregates
Another Example
Aggregates
• Operators: sum, count, max, min, median, ave
• “Having” clause
• Using dimension hierarchy
– average by region (within store)
– maximum by month (within date)
Cube Aggregation
Cube Operators
Extended Cube
Aggregation Using Hierarchies
Pivoting
Business Intelligence
• Premise: more data leads to better business decisions
– Periodic reporting as well as ad hoc queries
– Analysts, not programmers (importance of tools and dashboards)
• Examples:
– Slicing-and-dicing activity by different dimensions to better understand the marketplace
– Analyzing log data to improve OLTP experience
– Analyzing log data to better optimize ad placement
– Analyzing purchasing trends for better supply-chain management
– Mining for correlations between otherwise unrelated activities
OLTP/OLAP Architecture: Hadoop?
OLTP/OLAP/Hadoop Architecture
ETL Bottleneck
• Reporting is often a nightly task:
– ETL is often slow: why?
– What happens if processing 24 hours of data takes longer than 24 hours?
• Hadoop is perfect:
– Most likely, you already have some data warehousing solution
– Ingest is limited by speed of HDFS
– Scales out with more nodes
– Massively parallel
– Ability to use any processing tool
– Much cheaper than parallel databases
– ETL is a batch process anyway!
MapReduce algorithms
for processing relational data
Design Pattern: Secondary Sorting
• MapReduce sorts input to reducers by key
– Values are arbitrarily ordered
• What if want to sort value also?
– E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…
Secondary Sorting: Solutions
• Solution 1:
– Buffer values in memory, then sort
– Why is this a bad idea?
• Solution 2:
– “Value-to-key conversion” design pattern: form composite intermediate key, (k, v1)
– Let execution framework do the sorting
– Preserve state across multiple key-value pairs to handle processing
– Anything else we need to do?
Value-to-Key Conversion
Working Scenario
• Two tables:
– User demographics (gender, age, income, etc.)
– User page visits (URL, time spent, etc.)
• Analyses we might want to perform:
– Statistics on demographic characteristics
– Statistics on page visits
– Statistics on page visits by URL
– Statistics on page visits by demographic characteristic
– …
Relational Algebra
• Primitives
– Projection ()
– Selection ()
– Cartesian product ()
– Set union ()
– Set difference ()
– Rename ()
• Other operations
– Join (⋈)
– Group by… aggregation
– …
Projection
Projection in MapReduce
• Easy!
– Map over tuples, emit new tuples with appropriate attributes
– No reducers, unless for regrouping or resorting tuples
– Alternatively: perform in reducer, after some other processing
• Basically limited by HDFS streaming speeds
– Speed of encoding/decoding tuples becomes important
– Relational databases take advantage of compression
– Semistructured data? No problem!
Selection
Selection in MapReduce
• Easy!
– Map over tuples, emit only tuples that meet criteria
– No reducers, unless for regrouping or resorting tuples
– Alternatively: perform in reducer, after some other processing
• Basically limited by HDFS streaming speeds
– Speed of encoding/decoding tuples becomes important
– Relational databases take advantage of compression
– Semistructured data? No problem!
Group by… Aggregation
• Example: What is the average time spent per URL?
• In SQL:
– SELECT url, AVG(time) FROM visits GROUP BY url
• In MapReduce:
– Map over tuples, emit time, keyed by url
– Framework automatically groups values by keys
– Compute average in reducer
– Optimize with combiners
Relational Joins
Natural Join Operation – Example
• Relations r, s:
Natural Join Example
Types of Relationships
Join Algorithms in MapReduce
• Reduce-side join
• Map-side join
• In-memory join
– Striped variant
– Memcached variant
Reduce-side Join
• Basic idea: group by join key
– Map over both sets of tuples
– Emit tuple as value with join key as the intermediate key
– Execution framework brings together tuples sharing the same key
– Perform actual join in reducer
– Similar to a “sort-merge join” in database terminology
• Two variants
– 1-to-1 joins
– 1-to-many and many-to-many joins
Reduce-side Join: 1-to-1
Reduce-side Join: 1-to-many
Reduce-side Join: V-to-K Conversion
Reduce-side Join: many-to-many
Map-side Join: Basic Idea
Assume two datasets are sorted by the join key:
Map-side Join: Parallel Scans
• If datasets are sorted by join key, join can be accomplished by a scan over both datasets
• How can we accomplish this in parallel?
– Partition and sort both datasets in the same manner
• In MapReduce:
– Map over one dataset, read from other corresponding partition
– No reducers necessary (unless to repartition or resort)
• Consistently partitioned datasets: realistic to expect?
In-Memory Join
• Basic idea: load one dataset into memory, stream over other dataset
– Works if R << S and R fits into memory
– Called a “hash join” in database terminology
• MapReduce implementation
– Distribute R to all nodes
– Map over S, each mapper loads R in memory, hashed by join key
– For every tuple in S, look up join key in R
– No reducers, unless for regrouping or resorting tuples
In-Memory Join: Variants
• Striped variant:
– R too big to fit into memory?
– Divide R into R1, R2, R3, … s.t. each Rn fits into memory
– Perform in-memory join: n, Rn ⋈ S
– Take the union of all join results
• Memcached join:
– Load R into memcached
– Replace in-memory hash lookup with memcached lookup
Memcached
Memcached Join
• Memcached join:
– Load R into memcached
– Replace in-memory hash lookup with memcached lookup
• Capacity and scalability?
– Memcached capacity >> RAM of individual node
– Memcached scales out with cluster
• Latency?
– Memcached is fast (basically, speed of network)
– Batch requests to amortize latency costs
Which join to use?
• In-memory join > map-side join > reduce-side join
– Why?
• Limitations of each?
– In-memory join: memory
– Map-side join: sort order and partitioning
– Reduce-side join: general purpose
Processing Relational Data: Summary
• MapReduce algorithms for processing relational data:
– Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce
– Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer
– Multiple strategies for relational joins
• Complex operations require multiple MapReduce jobs
– Example: top ten URLs in terms of average time spent
– Opportunities for automatic optimization
Evolving roles for
relational database and MapReduce
OLTP/OLAP/Hadoop Architecture
Need for High-Level Languages
• Hadoop is great for large-data processing!
– But writing Java programs for everything is verbose and slow
– Analysts don’t want to (or can’t) write Java
• Solution: develop higher-level data processing languages
– Hive: HQL is like SQL
– Pig: Pig Latin is a bit like Perl
Hive and Pig
• Hive: data warehousing application in Hadoop
– Query language is HQL, variant of SQL
– Tables stored on HDFS as flat files
– Developed by Facebook, now open source
• Pig: large-scale data processing system
– Scripts are written in Pig Latin, a dataflow language
– Developed by Yahoo!, now open source
– Roughly 1/3 of all Yahoo! internal jobs
• Common idea:
– Provide higher-level language to facilitate large-data processing
– Higher-level language “compiles down” to Hadoop jobs
'myPPT' 카테고리의 다른 글
당뇨병::정의, 증상과 원인,진단과 종류, 합병증,관리 (0) | 2014.05.17 |
---|---|
윤리학과 도덕교육의 관계 (0) | 2014.05.11 |
EXISTENTIALISM::Absurdism (0) | 2014.05.03 |
식초::기원, 장점, 단점, 기미와 주근께를 예방하는 방법 (0) | 2014.04.23 |
인간:: 행위-인격 (0) | 2014.04.05 |