Monday, May 19, 2014

How To Gain Business Value From Hadoop


Apache Hadoop is the hottest thing in Big Data Technology and is a software framework that allows distributed processing of larger data sets using a simple programming model across clusters of commodity servers  . This open source software platform is managed by Apache Software Foundation and was primarily named after a toy elephant. Its been very helpful in storing and managing vast amounts of data efficiently and cheaply . Hadoop is been designed to scale up from single to thousands of machines each offering local storage and computation unique to Hadoop and subprojects HDFS, MapReduce, Hive , Pig and Hbase .
Wondering what makes Hadoop special ??
The inspiration for hadoop came from Google’s work on MapReduce a programming model for distributed computing that allowed big data to be stored, managed and accessed from multiple servers . Doug Cutting then created Hadoop to manage data which is too big for conventional Databases .
The library has been designed with a high degree of fault tolerance, instead of relying on high-end hardware , the resiliency of clusters comes from its ability to handle failures at application level.
Two important qualities of Apache Hadoop is its Efficiency and Robustness . Big Data application will continue to run even when individual servers/clusters fail and it does not require application to shuttle large volumes of data across network .
If you look deeper than you will realise that Hadoop is also modular means you can swap any of its components for different software tool making the whole architecture flexible , robust and highly efficient.
One of the reasons for Hadoop’s popularity is because companies  like Google, Yahoo. Facebook and Amazon use it on huge data sets . It makes shorter works for larger tasks managing the big data flowing for these online giants. But they aren’t the only ones who can benefit from it, enterprises are also adopting it on larger scales. Technology is being flowing from big internet companies and adapting it for enterprise environment.
                  
How Hadoop Works ?
Unlike traditional applications and hardware architectures Hadoop applications are more like Batch jobs that transforms data from a database into a data warehouse . In general terms hadoop takes chunks of data , perform actions on it and handovers to  another application to utilize and all this takes place at massive scale . This scale is achieved using Distributed Computing i.e larger no. of commodity computers processing the data at the same time . The only way to achieve this type of computing in a quick timeframe is to distribute to thousands of commodity servers . These server nodes are grouped into racks and further into clusters all connected by a high speed network.

Hadoop has Two main Subprojects :-

1.  HDFS (Hadoop Database File System)
This component helps to manage the data, split the data , put it on different nodes and replicate it . HDFS spans all nodes in cluster for data storage and links the file system on local nodes to make them into one big system . It achieves reliability by replicating data across multiple nodes  .
HDFS runs on a cluster of nodes and can handle very large no. of files, it breaks larger files on multiple nodes for efficient storage.

                  

2.  MapReduce
This component understands and assigns work to nodes in cluster . It performs parallel data processing ,progress rate and calculates the result of the job .
Hadoop’s supplemented with an ecosystem of Apache Projects like Pig , Hive and Zookeeper that extends the overall value and improves usability .
A Relational Database is analyzed using queries (SQL) .Non-Relational databases also use queries but are not constrained to SQL and can use other queries to pull information from databases . But hadoop does not involve any queries and is more of a data warehousing system , so it needs a system like Mapreduce to process and analyze its data .
MapReduce provides an efficient and fast way of running queries over big data . Instead of copying a file for querying it runs them on same system on which data is stored , on reducing they are sent back to the user program .
MapReduce runs a series of Job each being a separate Java Application . There are two steps in MapReduce process - Map and Reduce . Suppose you require to count no. of blogs regarding a specific entry lets say Big data and want to count no. of times hadoop , bigdata are mentioned. So first the file will split on HDFS then all nodes will pass through the map computation for their datasets i.e counting the no. of times those words show up . After mapping the node output is a list of key-value pairs , this result is then sent to other nodes as inputs for reduce step. Before commencing the reduce step these key value pairs are shuffled and stored . The reduce step then sums up list into single entries for individual word .
          

3.  Job Tracker
Its one of the important components of Hadoop and is responsible to manage everything stated above . If manually we have to divide terabytes of data and copy it to thousands of computers then the process will take forever to kick job off . So there is  a set of components which automates this step . Entire process is a Job in Hadoop and Job Tracker divides each job into tasks and schedules them to run on nodes. It keeps tracks of all participating nodes , handles failures and monitors the process . Task Tracker reports to the Job tracker and runs tasks. With this arrangement hadoop can distribute jobs on larger number of nodes in parallel .                                  
Hadoop is most widely used system for managing large amounts of data and has surely changed the dynamics and economics of large scale computing .

So lets Boil down some of its powerful capabilities and salient characteristics -
1. Flexibility -
Hadoop can absorb any type of structured or unstructured data and is schema-less . Data from any no. of sources can be aggregated for deeper analysis in arbitrary ways .

2.  Scalability -
New nodes can be easily added without the need to change data formats or how jobs are written or data is loaded etc.  It doesn’t requires data to be changed in a different format like in a traditional data warehouse . Data is not lost in translation process in hadoop . It is a good framework that allows data analysts to choose when and how to perform data analysis.

3. Cost Efficiency -
It offers massive parallel computing which results in sizeable decrease in cost per terabyte of storage  and nodes can be added or removed as per project demands . This ability is driving organization to harness more data for projects which previously never made any business sense .

4.  Complex Data Analysis -
Hadoop considers complex and diverse data like images , videos , text , real-time feeds, devices, scientific sensors etc . While it is often used for petabytes of data many organization perform processing on terabytes scale , hadoop with its mapreduce framework can abstract the complexity of distributed parallel processing across multiple nodes providing huge benefits of scaling .

5.  Fault Tolerance  -
It is highly robust , it will continue to run even when clusters fail and even you lose a node the system redirects work onto another location without posing a break in processing .

Some important sectors where its powerful features can be leveraged include -
1.  Social Media Data   -   With hadoop you can mine social media data and conversions for real time and make respective decisions to increase market share . Its been used used to track media consumption and engagement , advertising, customer retention as well as operations . The Video gaming industry is one of the huge users of it to analyze performance and tracking during gameplay .
2 .  Clickstream Data   -   It can help a lot with customer segmentation making it easier to  visualize and understand how visitors behave on your website . This clickstream data can be used to help with conversions and to reduce bounces .
3 .  Server Log Data   -   Hadoop provides a low cost platform to analyze server logs by speeding and improving security forensics  .  With hadoop its easy to store , identify and refine patterns providing insights to ease out business decision process .
4 .  Financial & Automotive Industry   -   Financial sector uses Hadoop to analyze big scale investments and to make better financial decisions . In Automotive industry its been used to identify issues of travelling arrangements , lower maintenance costs, avoid collisions etc.
Hadoop’s biggest advantage is its speed , it is capable to generate comprehensive reports which would have otherwise taken weeks . Its possibilities in enterprises are endless and that’s why it will still remain the big elephant in Big Data room for some more time.
To discuss how we can help you, please contact with our team at info@oodlestechnologies.com  or skype : oodles.tech