Database and MapReduce








































































Database and MapReduce

 


RoadMap

Role of relational databases in today’s organizations

Where does MapReduce fit in?

MapReduce algorithms for processing relational data

How do I perform a join, etc.?

Evolving roles of relational databases and MapReduce

What’s in store for the future?

Relational Database Basics

Basic Structure

Formally, given sets D1, D2, …. Dn a relation r is a subset of 

        D1 x  D2  x … x Dn

Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai   Di

Example:

              customer_name =  {Jones, Smith, Curry, Lindsay}

customer_street =  {Main, North, Park}

customer_city     =  {Harrison, Rye, Pittsfield}

Then r = {   (Jones, Main, Harrison), 

                   (Smith, North, Rye),

                   (Curry, North, Rye),

                   (Lindsay, Park, Pittsfield) }

 is a relation over 

customer_name , customer_street,  customer_city

Relation Schema

A1, A2, …, An are attributes


R = (A1, A2, …, An ) is a relation schema

Example:

Customer_schema = (customer_name, customer_street, customer_city)


r(R) is a relation on the relation schema R

Example:

customer (Customer_schema)

Relation Instance

The current values (relation instance) of a relation are specified by a table

An element t of r is a tuple, represented by a row in a table

Database

A database consists of multiple relations

Information about an enterprise is broken up into parts, with  each relation storing one part of the information

account :   stores information about accounts

        depositor : stores information about which customer

                          owns which account 

        customer : stores information about customers

Storing all information as a single relation such as 

   bank(account_number, balance, customer_name, ..)

results in repetition of information (e.g., two customers own an account) and the need for null values  (e.g., represent a customer without an account)

Banking Example

branch (branch-name, branch-city, assets)


customer (customer-name, customer-street, customer-city)


account (account-number, branch-name, balance)


loan (loan-number, branch-name, amount)


depositor (customer-name, account-number)


borrower (customer-name, loan-number)

Relational Algebra

Primitives

Projection ()

Selection ()

Cartesian product ()

Set union ()

Set difference ()

Rename ()

Other operations

Join (⋈)

Group by… aggregation

Big Data Analysis

Peta-scale datasets are everywhere:

Facebook has 2.5 PB of user data + 15 TB/day (4/2009) 

eBay has 6.5 PB of user data + 50 TB/day (5/2009)

A lot of these datasets are (mostly) structured

Query logs

Point-of-sale records

User data (e.g., demographics)

How do we perform data analysis at scale?

Relational databases and SQL

MapReduce (Hadoop)

Relational Databases vs. MapReduce

Relational databases:

Multipurpose: analysis and transactions; batch and interactive

Data integrity via ACID transactions

Lots of tools in software ecosystem (for ingesting, reporting, etc.)

Supports SQL (and SQL integration, e.g., JDBC)

Automatic SQL query optimization

MapReduce (Hadoop):

Designed for large clusters, fault tolerant

Data is accessed in “native format”

Supports many query languages

Programmers retain control over performance

Open source

Database Workloads

OLTP (online transaction processing)

Typical applications: e-commerce, banking, airline reservations

User facing: real-time, low latency, highly-concurrent

Tasks: relatively small set of “standard” transactional queries

Data access pattern: random reads, updates, writes (involving relatively small amounts of data)

OLAP (online analytical processing)

Typical applications: business intelligence, data mining

Back-end processing: batch workloads, less concurrency

Tasks: complex analytical queries, often ad hoc

Data access pattern: table scans, large amounts of data involved per query


One Database or Two?

Downsides of co-existing OLTP and OLAP workloads

Poor memory management

Conflicting data access patterns

Variable latency

Solution: separate databases

User-facing OLTP database for high-volume transactions

Data warehouse for OLAP workloads

How do we connect the two?

OLTP/OLAP Architecture

OLTP/OLAP Integration

OLTP database for user-facing transactions

Retain records of all activity

Periodic ETL (e.g., nightly)

Extract-Transform-Load (ETL)

Extract records from source

Transform: clean data, check integrity, aggregate, etc.

Load into OLAP database

OLAP database for data warehousing

Business intelligence: reporting, ad hoc queries, data mining, etc.

Feedback to improve OLTP services

Warehouse Models & Operators

Data Models

relations

stars & snowflakes

cubes

Operators

slice & dice

roll-up, drill down

pivoting

other

Star

Star Schema

Terms

Fact table

Dimension tables

Measures

Dimension Hierarchies

Cube

3-D Cube

ROLAP vs. MOLAP

ROLAP:

Relational On-Line Analytical Processing

MOLAP:

Multi-Dimensional On-Line Analytical Processing

Typical OLAP Queries 

The typical OLAP query will:

Start with a star join.

Select for interesting tuples, based on dimension data.

Group by one or more dimensions.

Aggregate certain attributes of the result.

Aggregates

Aggregates

Another Example

Aggregates

Operators: sum, count, max, min,       median, ave

“Having” clause

Using dimension hierarchy

average by region (within store)

maximum by month (within date)

Cube Aggregation

Cube Operators

Extended Cube

Aggregation Using Hierarchies

Pivoting

Business Intelligence

Premise: more data leads to better business decisions

Periodic reporting as well as ad hoc queries

Analysts, not programmers (importance of tools and dashboards)

Examples:

Slicing-and-dicing activity by different dimensions to better understand the marketplace

Analyzing log data to improve OLTP experience

Analyzing log data to better optimize ad placement

Analyzing purchasing trends for better supply-chain management

Mining for correlations between otherwise unrelated activities


OLTP/OLAP Architecture: Hadoop?

OLTP/OLAP/Hadoop Architecture

ETL Bottleneck

Reporting is often a nightly task:

ETL is often slow: why?

What happens if processing 24 hours of data takes longer than 24 hours?

Hadoop is perfect:

Most likely, you already have some data warehousing solution

Ingest is limited by speed of HDFS

Scales out with more nodes

Massively parallel

Ability to use any processing tool

Much cheaper than parallel databases

ETL is a batch process anyway!

MapReduce algorithms 

for processing relational data

Design Pattern: Secondary Sorting

MapReduce sorts input to reducers by key

Values are arbitrarily ordered

What if want to sort value also?

E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…


Secondary Sorting: Solutions

Solution 1:

Buffer values in memory, then sort

Why is this a bad idea?

Solution 2:

“Value-to-key conversion” design pattern: form composite intermediate key, (k, v1)

Let execution framework do the sorting

Preserve state across multiple key-value pairs to handle processing

Anything else we need to do?

Value-to-Key Conversion

Working Scenario

Two tables:

User demographics (gender, age, income, etc.)

User page visits (URL, time spent, etc.)

Analyses we might want to perform:

Statistics on demographic characteristics

Statistics on page visits

Statistics on page visits by URL

Statistics on page visits by demographic characteristic

Relational Algebra

Primitives

Projection ()

Selection ()

Cartesian product ()

Set union ()

Set difference ()

Rename ()

Other operations

Join (⋈)

Group by… aggregation

Projection 

Projection in MapReduce

Easy!

Map over tuples, emit new tuples with appropriate attributes

No reducers, unless for regrouping or resorting tuples

Alternatively: perform in reducer, after some other processing

Basically limited by HDFS streaming speeds

Speed of encoding/decoding tuples becomes important

Relational databases take advantage of compression

Semistructured data? No problem!


Selection

Selection in MapReduce

Easy!

Map over tuples, emit only tuples that meet criteria

No reducers, unless for regrouping or resorting tuples

Alternatively: perform in reducer, after some other processing

Basically limited by HDFS streaming speeds

Speed of encoding/decoding tuples becomes important

Relational databases take advantage of compression

Semistructured data? No problem!

Group by… Aggregation

Example: What is the average time spent per URL?

In SQL:

SELECT url, AVG(time) FROM visits GROUP BY url

In MapReduce:

Map over tuples, emit time, keyed by url

Framework automatically groups values by keys

Compute average in reducer

Optimize with combiners


Relational Joins

Natural Join Operation – Example

Relations r, s:

Natural Join Example

Types of Relationships

Join Algorithms in MapReduce

Reduce-side join

Map-side join

In-memory join

Striped variant

Memcached variant

Reduce-side Join

Basic idea: group by join key

Map over both sets of tuples

Emit tuple as value with join key as the intermediate key

Execution framework brings together tuples sharing the same key

Perform actual join in reducer

Similar to a “sort-merge join” in database terminology

Two variants

1-to-1 joins

1-to-many and many-to-many joins


Reduce-side Join: 1-to-1

Reduce-side Join: 1-to-many

Reduce-side Join: V-to-K Conversion

Reduce-side Join: many-to-many

Map-side Join: Basic Idea

Assume two datasets are sorted by the join key:

Map-side Join: Parallel Scans

If datasets are sorted by join key, join can be accomplished by a scan over both datasets

How can we accomplish this in parallel?

Partition and sort both datasets in the same manner

In MapReduce:

Map over one dataset, read from other corresponding partition

No reducers necessary (unless to repartition or resort)

Consistently partitioned datasets: realistic to expect?

In-Memory Join

Basic idea: load one dataset into memory, stream over other dataset

Works if R << S and R fits into memory

Called a “hash join” in database terminology

MapReduce implementation

Distribute R to all nodes

Map over S, each mapper loads R in memory, hashed by join key

For every tuple in S, look up join key in R

No reducers, unless for regrouping or resorting tuples


In-Memory Join: Variants

Striped variant:

R too big to fit into memory? 

Divide R into R1, R2, R3, … s.t. each Rn fits into memory

Perform in-memory join: n, Rn ⋈ S

Take the union of all join results

Memcached join:

Load R into memcached

Replace in-memory hash lookup with memcached lookup

Memcached

Memcached Join

Memcached join:

Load R into memcached

Replace in-memory hash lookup with memcached lookup

Capacity and scalability?

Memcached capacity >> RAM of individual node

Memcached scales out with cluster

Latency?

Memcached is fast (basically, speed of network)

Batch requests to amortize latency costs


Which join to use?

In-memory join > map-side join > reduce-side join

Why?

Limitations of each?

In-memory join: memory

Map-side join: sort order and partitioning

Reduce-side join: general purpose

Processing Relational Data: Summary

MapReduce algorithms for processing relational data:

Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce

Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer

Multiple strategies for relational joins

Complex operations require multiple MapReduce jobs

Example: top ten URLs in terms of average time spent

Opportunities for automatic optimization


Evolving roles for 

relational database and MapReduce

OLTP/OLAP/Hadoop Architecture

Need for High-Level Languages

Hadoop is great for large-data processing!

But writing Java programs for everything is verbose and slow

Analysts don’t want to (or can’t) write Java

Solution: develop higher-level data processing languages

Hive: HQL is like SQL

Pig: Pig Latin is a bit like Perl

Hive and Pig

Hive: data warehousing application in Hadoop

Query language is HQL, variant of SQL

Tables stored on HDFS as flat files

Developed by Facebook, now open source

Pig: large-scale data processing system

Scripts are written in Pig Latin, a dataflow language

Developed by Yahoo!, now open source

Roughly 1/3 of all Yahoo! internal jobs

Common idea:

Provide higher-level language to facilitate large-data processing

Higher-level language “compiles down” to Hadoop jobs






Posted by MSNU