Database and MapReduce

myPPT 2014. 5. 4. 00:19

Database and MapReduce

RoadMap

• Role of relational databases in today’s organizations

– Where does MapReduce fit in?

• MapReduce algorithms for processing relational data

– How do I perform a join, etc.?

• Evolving roles of relational databases and MapReduce

– What’s in store for the future?

Relational Database Basics

Basic Structure

• Formally, given sets D1, D2, …. Dn a relation r is a subset of

D1 x D2 x … x Dn

Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai  Di

• Example:

customer_name = {Jones, Smith, Curry, Lindsay}

customer_street = {Main, North, Park}

customer_city = {Harrison, Rye, Pittsfield}

Then r = { (Jones, Main, Harrison),

(Smith, North, Rye),

(Curry, North, Rye),

(Lindsay, Park, Pittsfield) }

is a relation over

customer_name , customer_street, customer_city

Relation Schema

• A1, A2, …, An are attributes

• R = (A1, A2, …, An ) is a relation schema

Example:

Customer_schema = (customer_name, customer_street, customer_city)

• r(R) is a relation on the relation schema R

Example:

customer (Customer_schema)

Relation Instance

• The current values (relation instance) of a relation are specified by a table

• An element t of r is a tuple, represented by a row in a table

Database

• A database consists of multiple relations

• Information about an enterprise is broken up into parts, with each relation storing one part of the information

account : stores information about accounts

depositor : stores information about which customer

owns which account

customer : stores information about customers

• Storing all information as a single relation such as

bank(account_number, balance, customer_name, ..)

results in repetition of information (e.g., two customers own an account) and the need for null values (e.g., represent a customer without an account)

Banking Example

branch (branch-name, branch-city, assets)

customer (customer-name, customer-street, customer-city)

account (account-number, branch-name, balance)

loan (loan-number, branch-name, amount)

depositor (customer-name, account-number)

borrower (customer-name, loan-number)

Relational Algebra

• Primitives

– Projection ()

– Selection ()

– Cartesian product ()

– Set union ()

– Set difference ()

– Rename ()

• Other operations

– Join (⋈)

– Group by… aggregation

– …

Big Data Analysis

• Peta-scale datasets are everywhere:

– Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

– eBay has 6.5 PB of user data + 50 TB/day (5/2009)

– …

• A lot of these datasets are (mostly) structured

– Query logs

– Point-of-sale records

– User data (e.g., demographics)

– …

• How do we perform data analysis at scale?

– Relational databases and SQL

– MapReduce (Hadoop)

Relational Databases vs. MapReduce

• Relational databases:

– Multipurpose: analysis and transactions; batch and interactive

– Data integrity via ACID transactions

– Lots of tools in software ecosystem (for ingesting, reporting, etc.)

– Supports SQL (and SQL integration, e.g., JDBC)

– Automatic SQL query optimization

• MapReduce (Hadoop):

– Designed for large clusters, fault tolerant

– Data is accessed in “native format”

– Supports many query languages

– Programmers retain control over performance

– Open source

Database Workloads

• OLTP (online transaction processing)

– Typical applications: e-commerce, banking, airline reservations

– User facing: real-time, low latency, highly-concurrent

– Tasks: relatively small set of “standard” transactional queries

– Data access pattern: random reads, updates, writes (involving relatively small amounts of data)

• OLAP (online analytical processing)

– Typical applications: business intelligence, data mining

– Back-end processing: batch workloads, less concurrency

– Tasks: complex analytical queries, often ad hoc

– Data access pattern: table scans, large amounts of data involved per query

One Database or Two?

• Downsides of co-existing OLTP and OLAP workloads

– Poor memory management

– Conflicting data access patterns

– Variable latency

• Solution: separate databases

– User-facing OLTP database for high-volume transactions

– Data warehouse for OLAP workloads

– How do we connect the two?

OLTP/OLAP Architecture

OLTP/OLAP Integration

• OLTP database for user-facing transactions

– Retain records of all activity

– Periodic ETL (e.g., nightly)

• Extract-Transform-Load (ETL)

– Extract records from source

– Transform: clean data, check integrity, aggregate, etc.

– Load into OLAP database

• OLAP database for data warehousing

– Business intelligence: reporting, ad hoc queries, data mining, etc.

– Feedback to improve OLTP services

Warehouse Models & Operators

• Data Models

– relations

– stars & snowflakes

– cubes

• Operators

– slice & dice

– roll-up, drill down

– pivoting

– other

Star

Star Schema

Terms

• Fact table

• Dimension tables

• Measures

Dimension Hierarchies

Cube

3-D Cube

ROLAP vs. MOLAP

• ROLAP:

Relational On-Line Analytical Processing

• MOLAP:

Multi-Dimensional On-Line Analytical Processing

Typical OLAP Queries

• The typical OLAP query will:

 Start with a star join.

 Select for interesting tuples, based on dimension data.

 Group by one or more dimensions.

 Aggregate certain attributes of the result.

Aggregates

Another Example

Aggregates

• Operators: sum, count, max, min, median, ave

• “Having” clause

• Using dimension hierarchy

– average by region (within store)

– maximum by month (within date)

Cube Aggregation

Cube Operators

Extended Cube

Aggregation Using Hierarchies

Pivoting

Business Intelligence

• Premise: more data leads to better business decisions

– Periodic reporting as well as ad hoc queries

– Analysts, not programmers (importance of tools and dashboards)

• Examples:

– Slicing-and-dicing activity by different dimensions to better understand the marketplace

– Analyzing log data to improve OLTP experience

– Analyzing log data to better optimize ad placement

– Analyzing purchasing trends for better supply-chain management

– Mining for correlations between otherwise unrelated activities

OLTP/OLAP Architecture: Hadoop?

OLTP/OLAP/Hadoop Architecture

ETL Bottleneck

• Reporting is often a nightly task:

– ETL is often slow: why?

– What happens if processing 24 hours of data takes longer than 24 hours?

• Hadoop is perfect:

– Most likely, you already have some data warehousing solution

– Ingest is limited by speed of HDFS

– Scales out with more nodes

– Massively parallel

– Ability to use any processing tool

– Much cheaper than parallel databases

– ETL is a batch process anyway!

MapReduce algorithms

for processing relational data

Design Pattern: Secondary Sorting

• MapReduce sorts input to reducers by key

– Values are arbitrarily ordered

• What if want to sort value also?

– E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…

Secondary Sorting: Solutions

• Solution 1:

– Buffer values in memory, then sort

– Why is this a bad idea?

• Solution 2:

– “Value-to-key conversion” design pattern: form composite intermediate key, (k, v1)

– Let execution framework do the sorting

– Preserve state across multiple key-value pairs to handle processing

– Anything else we need to do?

Value-to-Key Conversion

Working Scenario

• Two tables:

– User demographics (gender, age, income, etc.)

– User page visits (URL, time spent, etc.)

• Analyses we might want to perform:

– Statistics on demographic characteristics

– Statistics on page visits

– Statistics on page visits by URL

– Statistics on page visits by demographic characteristic

– …

Relational Algebra

• Primitives

– Projection ()

– Selection ()

– Cartesian product ()

– Set union ()

– Set difference ()

– Rename ()

• Other operations

– Join (⋈)

– Group by… aggregation

– …

Projection

Projection in MapReduce

• Easy!

– Map over tuples, emit new tuples with appropriate attributes

– No reducers, unless for regrouping or resorting tuples

– Alternatively: perform in reducer, after some other processing

• Basically limited by HDFS streaming speeds

– Speed of encoding/decoding tuples becomes important

– Relational databases take advantage of compression

– Semistructured data? No problem!

Selection

Selection in MapReduce

• Easy!

– Map over tuples, emit only tuples that meet criteria

– No reducers, unless for regrouping or resorting tuples

– Alternatively: perform in reducer, after some other processing

• Basically limited by HDFS streaming speeds

– Speed of encoding/decoding tuples becomes important

– Relational databases take advantage of compression

– Semistructured data? No problem!

Group by… Aggregation

• Example: What is the average time spent per URL?

• In SQL:

– SELECT url, AVG(time) FROM visits GROUP BY url

• In MapReduce:

– Map over tuples, emit time, keyed by url

– Framework automatically groups values by keys

– Compute average in reducer

– Optimize with combiners

Relational Joins

Natural Join Operation – Example

• Relations r, s:

Natural Join Example

Types of Relationships

Join Algorithms in MapReduce

• Reduce-side join

• Map-side join

• In-memory join

– Striped variant

– Memcached variant

Reduce-side Join

• Basic idea: group by join key

– Map over both sets of tuples

– Emit tuple as value with join key as the intermediate key

– Execution framework brings together tuples sharing the same key

– Perform actual join in reducer

– Similar to a “sort-merge join” in database terminology

• Two variants

– 1-to-1 joins

– 1-to-many and many-to-many joins

Reduce-side Join: 1-to-1

Reduce-side Join: 1-to-many

Reduce-side Join: V-to-K Conversion

Reduce-side Join: many-to-many

Map-side Join: Basic Idea

Assume two datasets are sorted by the join key:

Map-side Join: Parallel Scans

• If datasets are sorted by join key, join can be accomplished by a scan over both datasets

• How can we accomplish this in parallel?

– Partition and sort both datasets in the same manner

• In MapReduce:

– Map over one dataset, read from other corresponding partition

– No reducers necessary (unless to repartition or resort)

• Consistently partitioned datasets: realistic to expect?

In-Memory Join

• Basic idea: load one dataset into memory, stream over other dataset

– Works if R << S and R fits into memory

– Called a “hash join” in database terminology

• MapReduce implementation

– Distribute R to all nodes

– Map over S, each mapper loads R in memory, hashed by join key

– For every tuple in S, look up join key in R

– No reducers, unless for regrouping or resorting tuples

In-Memory Join: Variants

• Striped variant:

– R too big to fit into memory?

– Divide R into R1, R2, R3, … s.t. each Rn fits into memory

– Perform in-memory join: n, Rn ⋈ S

– Take the union of all join results

• Memcached join:

– Load R into memcached

– Replace in-memory hash lookup with memcached lookup

Memcached

Memcached Join

• Memcached join:

– Load R into memcached

– Replace in-memory hash lookup with memcached lookup

• Capacity and scalability?

– Memcached capacity >> RAM of individual node

– Memcached scales out with cluster

• Latency?

– Memcached is fast (basically, speed of network)

– Batch requests to amortize latency costs

Which join to use?

• In-memory join > map-side join > reduce-side join

– Why?

• Limitations of each?

– In-memory join: memory

– Map-side join: sort order and partitioning

– Reduce-side join: general purpose

Processing Relational Data: Summary

• MapReduce algorithms for processing relational data:

– Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce

– Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer

– Multiple strategies for relational joins

• Complex operations require multiple MapReduce jobs

– Example: top ten URLs in terms of average time spent

– Opportunities for automatic optimization

Evolving roles for

relational database and MapReduce

OLTP/OLAP/Hadoop Architecture

Need for High-Level Languages

• Hadoop is great for large-data processing!

– But writing Java programs for everything is verbose and slow

– Analysts don’t want to (or can’t) write Java

• Solution: develop higher-level data processing languages

– Hive: HQL is like SQL

– Pig: Pig Latin is a bit like Perl

Hive and Pig

• Hive: data warehousing application in Hadoop

– Query language is HQL, variant of SQL

– Tables stored on HDFS as flat files

– Developed by Facebook, now open source

• Pig: large-scale data processing system

– Scripts are written in Pig Latin, a dataflow language

– Developed by Yahoo!, now open source

– Roughly 1/3 of all Yahoo! internal jobs

• Common idea:

– Provide higher-level language to facilitate large-data processing

– Higher-level language “compiles down” to Hadoop jobs

'myPPT' 카테고리의 다른 글

당뇨병::정의, 증상과 원인,진단과 종류, 합병증,관리 (0)	2014.05.17
윤리학과 도덕교육의 관계 (0)	2014.05.11
EXISTENTIALISM::Absurdism (0)	2014.05.03
식초::기원, 장점, 단점, 기미와 주근께를 예방하는 방법 (0)	2014.04.23
인간:: 행위-인격 (0)	2014.04.05

Posted by MSNU

MADE FOR ALL

Database and MapReduce

'myPPT' 카테고리의 다른 글

티스토리툴바