Apache Pig Tutorial - Tuple & Bag - Big Data In Real World

Apache Pig Tutorial – Tuple & Bag

Apache Pig Tutorial – Executing Script with Parameters
December 20, 2015
Apache Pig Tutorial – Map
December 31, 2015
Apache Pig Tutorial – Executing Script with Parameters
December 20, 2015
Apache Pig Tutorial – Map
December 31, 2015

Apache Pig Tutorial – Tuple & Bag

Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.

So far we have been using simple datatypes in Pig like chararray, float, int etc.. In this post we will see 2 complex types in Pig – Tuple & Bag.

To demonstrate this, look at the below set of instructions to group stock records by symbol from year 2003. Let’s describe grp_by_sym  to look at the structure.

grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);

grunt> filter_by_yr = FILTER stocks by GetYear(date) == 2003;

grunt> grp_by_sym = GROUP filter_by_yr BY symbol;

grunt> DESCRIBE grp_by_sym;

Dissect grp_by_sym

grunt> DESCRIBE grp_by_sym;
grp_by_sym: {group: chararray,filter_by_yr: {(exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)}}

From the output, we can see that the datatype of group is chararry. What is the datatype of filter_by_yr ?

You can see the structure of filter_by_yr has a curly braces {  followed by a parenthesis ( . Whenever you see a curly braces it is referred to as a Bag. Whenever you see a parenthesis it is referred to as a Tuple.

Tuple vs. Bag

Tuple is nothing but a record – with a collection of columns, in our case exchange, symbol, date etc.  Bag is nothing but a collection of records or Tuples. So if you look at the below structure of filter_by_yr , you can see filter_by_yr is a bag or in other words, filter_by_yr is a collection of records with columns exchange, symbol, date, open etc.

grunt> DESCRIBE grp_by_sym;
grp_by_sym: {group: chararray,filter_by_yr: {(exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)}}

You can also say grp_by_sym  is a bag because it has curly braces. Although in this case the parenthesis is omitted from the display  grp_by_sym  represents a bag or a collection of records or Tuples. We can define grp_by_sym  as a collection of Tuples with 2 columns in each Tuple – group  which of type chararray and filter_by_yr  which of type bag.

 See It In Action

Previous Lesson : Execute Script with Parameters

Next Lesson : Map

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] iframe { visibility: hidden; opacity: 0; } Previous Apache Pig Tutorial – […]

Apache Pig Tutorial – Tuple & Bag
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X