Apache Spark – Comparing RDD, Dataframe and Dataset APIs

 

 

At Ideata analytics we have been using Apache Spark since 2013 to build data pipelines. One reason why we love Apache Spark so much is the rich abstraction of its developer API to build complex data workflows and perform data analysis with minimal development effort.

Spark 1.6, introduced datasets API, which provides type safety to build complex data workflows. This has so far been missing in Dataframe API which was restricted you to manipulate data easily at compile time. Datasets API will continue to take advantages of Spark’s Catalyst optimizer and Tungsten fast in-memory encoding. This API will be in addition to the existing APIs or Spark (RDD and Dataframes). In this article, we will review these APIs that Spark provides and understand when to use them.

RDD(Resilient Distributed Dataset)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. An RDD is Spark’s representation of a set of data, spread across multiple machines in the cluster, with API to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc. and can easily handle data with no predefined structure.

Creating an RDD

RDDs can be created by either parallelizing an existing collection of objects, or by referencing a dataset in an external storage system.

1. Parallelized Collections

SparkContext’s provides parallelize method allows you to create an RDD using existing collection of objects. The elements of the collection are copied form a distributed dataset that can be operated on in parallel. For example, here is how you can create a parallelized collection holding the numbers 1 to 5:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

2.External Datasets

Spark leverages Hadoop InputFormat to create RDDs from Hadoop’s supported storage systems including local file system, HDFS, Sequence files, Cassandra, HBase,Amazon S3, etc.

Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines. Here is an example invocation:

 val distFile: RDD[String] = sc.textFile("data.txt")

 

RDD Features:

  • Distributed collection of JVM objects: RDD uses MapReduce operations which is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance.
  • Immutable: RDDs composed of a collection of records which are partitioned. A partition is a basic unit of parallelism in an RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations.
  • Fault tolerant: In a case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is the biggest benefit of RDD because it saves a lot of efforts in data management and replication and thus achieves faster computations.
  • Lazy evaluations: All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset . The transformations are only computed when an action requires a result to be returned to the driver program.
  • Functional transformations: RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
  • Data processing formats:  It can easily and efficiently process data which is structured as well as unstructured data.
  • Programming Languages supported:  RDD API is available in Java, Scala, Python and R.

RDD Limitation:

  • No inbuilt optimization engine: When working with structured data, RDDs cannot take advantages of Spark’s advanced optimizers including catalyst optimizer and Tungsten execution engine. Developers need to optimize each RDD based on its attributes.
  • Handling structured data: Unlike Dataframe and datasets, RDDs don’t infer the schema of the ingested data and requires the user to specify it.

Dataframes

Spark introduced Dataframes in Spark 1.3 release. Dataframe overcomes the key challanges that RDDs had.

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a R/Python Dataframe. Along with Dataframe, Spark also introduced catalyst optimizer, which leverages advanced programming features to build an extensible query optimizer.

 

Creating a Spark Dataframe

You can create a Spark Dataframe using an existing RDD or from external datasources.

1. Creating Dataframe from RDD

Dataframe can be created  from an existing RDD using toDF() method which comes with sqlContext implicits package. Following code shows how it is done:

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
rdd.toDF()

2. Creating Dataframe from datasource

Dataframe can also be created directly from the source like relational databases, file system etc. As an example, the following creates a DataFrame based on the content of a JSON file

val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("examples/src/main/resources/people.json")

 

Dataframe Features:

  • Distributed collection of Row Object: A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database, but with richer optimizations under the hood.
  • Data Processing: Processing structured and unstructured data formats (Avro, CSV, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, MySQL, etc). It can read and write from all these various datasources.
  • Optimization using catalyst optimizer: It powers both SQL queries and the DataFrame API. Dataframe use catalyst tree transformation framework in four phases,
    • Analyzing a logical plan to resolve references
    • Logical plan optimization
    • Physical planning
    • Code generation to compile parts of the query to Java bytecode.
  • Hive Compatibility: Using Spark SQL, you can run unmodified Hive queries on your existing Hive warehouses. It reuses Hive frontend and MetaStore and gives you full compatibility with existing Hive data, queries, and UDFs.
  • Tungsten: Tungsten provides a physical execution backend whichexplicitly manages memory and dynamically generates bytecode for expression evaluation.
  • Programming Languages supported:  Dataframe API is available in Java, Scala, Python, and R.

 

Dataframe Limitations

  • Compile-time type safety: As discussed, Dataframe API does not support compile time safety which limits you from manipulating data when the structure is not know. The following example works during compile time. However, you will get a Runtime exception when executing this code:
case class Person(name : String , age : Int) 
val dataframe = sqlContect.read.json("people.json") 
dataframe.filter("salary > 10000").show 
=> throws Exception : cannot resolve 'salary' given input age , name

This is challenging specially when you are working with several tranformation and aggregation steps.

  • Cannot operate on domain Object  (lost domain object): Once you have transformed a domain object into dataframe, you cannot regenerate it from it. In the following example, once we have create personDF from personRDD, we won’t be recover the original RDD of Person class (RDD[Person]).
case class Person(name : String , age : Int)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val personDF = sqlContect.createDataframe(personRDD)
personDF.rdd // returns RDD[Row] , does not returns RDD[Person]

 

Datasets API

Dataset API is an extension to DataFrames that provides a type-safe, object-oriented programming interface. It is a strongly-typed, immutable collection of objects that are mapped to a relational schema.

At the core of the Dataset, API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation. The tabular representation is stored using Spark internal Tungsten binary format, allowing for operations on serialized data and improved memory utilization. Spark 1.6 comes with support for automatically generating encoders for a wide variety of types, including primitive types (e.g. String, Integer, Long), Scala case classes, and Java Beans.

 

Creating a Dataset

Following are some ways in which you can create a Dataset:

1. Creating Dataset from JVM object collection:

Encoders for most common types are automatically provided by importing sqlContext.implicits package. One can use  ‘.toDS()’ method to create a dataset.

Below code shows how to create it from a sequence of integers:

val ds = Seq(1, 2, 3).toDS()

2. Creating Dataset from datasource

Dataset can also be created directly from external data sources like relational databases, local file system, HDFS etc. As an example, the following creates a Dataset based on the content of a JSON file:

val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path).as[Person]

 

Dataset Features

  • Provides best of both RDD and Dataframe: RDD (functional programming, type safe), DataFrame (relational  model, Query optimazation , Tungsten execution, sorting and shuffling)
  • Encoders: With the use of Encoders, it is easy to convert any JVM object into a Dataset, allowing users to work with both structured and unstructured data unlike Dataframe.
  • Programming Languages supported: Datasets API is currently only available in Scala and Java. Python and R are currently not supported in version 1.6. Python support is slated for  version 2.0.
  • Type Safety: Datasets API provides compile time safety which was not available in Dataframes. In the example below, we can see how Dataset can operate on domain objects with compile lambda functions.
case class Person(name : String , age : Int)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val personDF = sqlContect.createDataframe(personRDD)
val ds:Dataset[Person] = personDF.as[Person]
ds.filter(p => p.age > 25)
ds.filter(p => p.salary > 25)
 // error : value salary is not a member of person
ds.rdd // returns RDD[Person]

 

  • Interoperable: Datasets allows you to easily convert your existing RDDs and Dataframes into datasets without boilerplate code.

Datasets API Limitation:

  • Requires type casting to String: Querying the data from datasets currently requires us to specify the fields in the class as a string. Once we have queried the data, we are forced to cast column to the required data type. On the other hand, if we use map operation on Datasets, it will not use Catalyst optimizer. Example:
ds.select(col("name").as[String], $"age".as[Int]).collect()
  • No support for Python and R: As of release 1.6, Datasets only support Scala and Java. Python support will be introduced in Spark 2.0.

The Datasets API brings in several advantages over the existing RDD and Dataframe API with better type safety and functional programming. With the challenge of type casting requirements in the API, you would still not the required type safety and will make your code brittle.

With the performance improvements and support enhancements to be introduced in Spark 2.0, datasets will take centerstage in Spark application development process. However, it is currently in early phases of development and would require more work before it can be productionized.

Traditional Versus Self Service Data Preparation

Organizations are generating and collecting huge amount of data on a daily basis. These data sets are generated from numerous sources including website clickstream, sensor logs, Point of sales points etc. Analyzing these datasets can reveal insights that can improve business and operations by great lengths. However, most organizations are analyzing less than 20% of their data. Why? – Challenges with data preparation – Users are still following the conventional data preparation processes which were not designed to work with newer data types and data volumes.

Self Service Data Preparation Infographic

Data Preparation – The Traditional Way

Data Preparation process is tedious, time-consuming and messy. Business users often find it hard to handle it themselves. They end up offloading data preparation to IT teams and data analyst who build technology centric solutions ranging from SQL scripts to complex ETL jobs. Due to the complexity involved, these implementations typically run into weeks to months of execution and result in significant cost and time delay.

With the huge demand for data analysts to help with data preparation and analysis, it is also hard for organizations to find the right talent for the job. Moreover, the data cleaning process is very rigid and not agile enough for an analyst to handle requests of adding newer datasets or making modifications to the existing ones.

The Data Preparation Tool of the Future –

Self Service Data Preparation

Modern data analytics and cloud data preparation tools empower users to quickly prepare data for analysis allowing them to spend more time on analyzing data and less time in preparing it. With a self service data preparation interface, businesses can bridge the missing gap between data analysts and business users by giving them full control and insight into the data preparation methodology. Moreover, with empowered business users, who can perform self service data cleaning, enrichment and data discovery, IT teams can now focus on IT centric processes.

In the current big data landscape, with organizations building data lakes using modern data storage, self service data preparation and analytics tools are playing a pivotal role. Organizations can now prepare data for analysis at scale from data lakes. With an ability to work with semi-structured and unstructured data, users can now analyze data which much ease. They can easily connect and blend data from a number of data sources and data types.

Modern data preparation and analytics tools have changed the way businesses have been dealing with data problems. With an ability to perform visual data preparation and analytics, users can now gain instant insights into their data which empowers them to make faster and better decisions.

Self-service data preparation tools are rapidly being recognized as a necessary element to every data discovery or advanced analytics implementation. At Ideata Analytics, we’ve been delivering this service capability to prepare and enrich data yourself – allowing users to spend less time in mundane task of cleaning and enrichment, and more time on analysis. Now you no longer have to rely on IT teams for data preparation activities:

Try It Yourself and see the difference!

data warehouse modernization

Modernize your data warehouse to deal with big data

In today’s digital data landscape, we are generating and accumulating huge variety, volume, and velocity of data. This is a major success in terms of insights which we can derive from the data to assist decision making. However, many organizations are still facing challenges in implementing a standardised process to store, prepare and analyze these data sets.

The solution is to seek a smarter Big Data management solution.

Data Warehouse modernization: A simple solution to big data management

In order to deal with these newer, high volume datasets, organizations are implementing enterprise-grade data lakes on advanced big data technologies like Hadoop, Amazon Redshift and Google BigQuery. These data lakes are used as parallel storage and processing platform to the existing data warehouse systems. With the new age advanced analytics tools, users can integrate and analyze their existing data warehouse with the new data lakes.

Preventing data overload

Another situation where data lakes can come to rescue when organizations are faced with either of the below situations:
Company’s’ current data warehouse cannot scale to support the amount of data that is being recorded
Handling unstructured data received from sources like social media, machine logs, sensors, web sources etc., which are cannot be handled by our underlying data warehouse.
The associated cost of storing and maintaining such datasets in existing warehouse system is high

Benefits:
Data lakes pave the path for unrestricted analytics and helps in capturing information which was not possible earlier due to data warehouse restrictions.

When used and applied correctly, organizations can see 3 key benefits:

  • Data at your finger tips: Data lake makes current and historical data available for running analysis and thereby enabling business users to make more informative decisions.
  • Centralized storage to all of your data: Analysts can integrate data coming from multiple data sources and of different structures and include them in their analysis. They can build correlations and patterns and derive deep dive insights to get a consolidated view of their enterprise.
  • Faster analytics: The newer big data technologies are optimized for parallel processing and faster query response times on even peta-byte scale data. Compared to traditional data warehouses with which queries would run into hours and days, big data technologies ensure an in-time query performance.
data warehouse modernization pipeline

 

Finding the right solution

Migrating data from a data warehouse to big data platform is easier said than done. It can very well be extremely expensive and time consuming, depending upon the technology you choose.

Following are the three data lake platforms that are seeing good traction currently:

Apache Spark and Hadoop

Apache Hadoop is an open source framework that excels in distributed storage and processing of big data with the ability to scale up to several petabytes of data. Apache spark processes data in-memory and enables batch, real time and advanced analytics on top of Hadoop. With a combination of Hadoop and Spark, organizations can store all-structure data, build data pipelines and analyze data at scale.

Amazon Redshift

Redshift being a fully managed data warehouse solution allows users to run queries in sub-seconds to seconds of latency on their big data. Modern analytics tools can connect directly to Redshift and connect directly to Redshift.

Google BigQuery

The increase in the amount of data organizations are capturing and processing will continue to grow in coming years. To get the maximum benefits. Choosing the right big data strategy to get the most from your data is now critical.

Self Service Business Intelligence and Data Preparation tool for big data on AWS

 

Enable interactive data exploration and visual analytics of disparate sources on AWS

We are happy to announce that Ideata Analytics, a big data intelligence platform is now available on the AWS Marketplace. It provides a turn-key preconfigured analytics solution on the cloud. Connecting to various data sources, cleaning it and doing ad-hoc analysis to create insights and visualizations has never been easier and faster thanks to Ideata Analytics cloud based analytics platform.

You can get a free 15 day trial by visiting the AWS Marketplace or going directly to

https://aws.amazon.com/marketplace/pp/B01FR5QWSG

It provides out of the box connectors to major AWS services. The extensive list of data connectors includes support for Amazon Redshift, AWS S3, AWS EMR, Amazon aurora. Apart from that it also supports connectors to various big data sources like Hadoop, Spark, MongoDB, traditional RDBMS like mysql, oracle and mssql and files like csv,excel,json, xml and more.

This will really be useful for users who wants to leverage their existing AWS cloud infrastructure and want to combine it with the performance and scalability of a modern big data solution. It will be very simple to use, fully managed and invoiced through a single AWS account” said Pranjal Jain, founder of Ideata Analytics. “One can easily spin up an Ideata Analytics machine from marketplace and connect, clean and analyze their data kept in AWS sources or other external databases and files. It will help them jumpstart their analytics cycle within minutes on cloud

Ideata Analytics is built from ground up using latest big data technologies like apache spark, which it uses as its core processing engine. It empowers users to work with billions of data points and apply live transformation on it. The interactive and intuitive platform is designed for business users so that they can drag and drop data columns to build visualization and drill down and drill across datasets to reach to the point of interest.

The platform also speeds access, processing, and analysis of data on Amazon Redshift. It provides fast interactive analysis and data discovery capabilities to uncover hidden insights and a way for users to blend disparate data on the fly.


About Ideata, Inc.

Ideata Inc is committed to work with businesses to enable faster, better decision making using it’s end-to-end data analytics platform. Ideata Inc team includes veterans from banking and telecommunication industries with years of experience in big data, BI and analytics. The company has partnered with companies including Cloudera, Hortonworks, MapR, HP, IBM and AWS.

Test drive Ideata Analytics today – Free 15 days trial !

The Modern Analytics Approach Marketers Should Use

In the digital marketing era, every marketer has access to systems and processes to provide them all the data related to their marketing spends and efforts. The challenge, however, is to derive actionable insights from this enormous amount of data.

Challenges in getting to insights

 

The usual approach which they take is to hire more and more data scientists and engineers which can provide them with the insights they are looking for. But, due to the shortage of time and skilled force, the overall cost, time and associated effort balloons up.

To get a holistic picture marketers should look into all available data sets like weblogs, call center transactions, online visitors clickstreams etc to better understand their customers and plan appropriate strategy. However integrating and leveraging these datasets in their analysis is a complex and time-consuming task and often requires technical expertise.

 

The modern approach

 

 

To make a scalable and cost effective process, organizations need to empower existing marketing workforce with a self service analytics tool.

These tools are designed in a way that even non techincal users can perform self service data analytics and discovery on their data themselves. These user-friendly and interactive tools provide easy to use drag and drop interface which enables marketers to reach to insights without writing any code. They can access the entire data and use it to make decisions. Self service analytical tools give marketers the much-needed flexibility and ability to define rules on the fly.

It helps them move away from the rigid processes and adapt to the changing market conditions and react faster to customer interests. They can spot trends, build correlations and identify anomalies.

As the owner and the knower of the overall marketing strategy, marketers are the best person to ask the right questions. They know the processes, domain and data, they know which KPI will work. Using these tools, they can quickly test their hypothesis and optimize their campaign with data driven facts in no time.

Conclusion

With a smart and well-designed analytics process, non-technical marketers can also quickly integrate and merge data from different data silos to build converged insights. This new breed of analytics tools will empower marketers to question data and detect patterns whenever they want at their fingertips.

How to visualize your sales data on a world map

 

In this tutorial, we will quickly see how we can use Ideata analytics to quickly plot sales number on an interactive world map to get a detailed perspective of your sales by geography.

Step 1- Connect to data

We have sample sales data in our local MySQL database. In order to start visualizing our data, we will first connect to MySQL and import relevant tables.

 

Step 2- Enrich data with location specific details

 

Once the data is loaded the application will display a preview of the data. We can see that there are columns in the data which show us the store address. We will split the address column to extract city name and state.

In order to plot these locations on the world map, we will need actual latitude and longitude of the store. We can use the inbuilt geo co-ordinate lookup to perform this operation. From the drop-down menu in the newly created city column we will select “Lookup geo coordinates for location”.

 

 

This will generate latitude and longitude against all the city names.

We will go ahead and click on Finish button on the bottom right corner and import our data.

Step 3 — Plot your sales number on World Map.

 

Once the data is imported, we will click on the newly created data source which will take us to the analysis page. We will select the chart type as the map from the chart dropdown. From chart option panel on the right, we will select the map type as a pie chart.

From the left panel, we will drag and drop latitude and longitude columns in the top bar. This will plot the values on the map. We will go ahead and drag sales amount and product category on to “SIZE” and “GROUP BY” fields on the bottom left panel.

 

 

This will show us the detailed map marked with store locations. The size of the circles and numbers in them shows us the total sales and the different colors of the pie depicting sales for each product category in the given region. You can zoom into any of these locations to get a more granular view of the data.

Try it out on your own data by registering for a trial. Click here:  http://www.ideata-analytics.com/trial

Visualizing Modi Foreign Visits Data


Visualization of Prime Minister Narendra Modi’s foreign visits from the time he took charge as the Prime Minister of India.


A simple quick activity of visualizing the data to come up with a dashboard summarizing all of Narendra Modi’s 35 foreign tours till Jan 2015 on Ideata Analytics Platform .

We collected the information from documents and facts available on different website and reconciled in to an excel sheet.

Uploading the excel sheet was an easy task — After choosing excel as my datasource, I just have to choose my excel file and click on upload which gives me preview of my data present in the sheer. I did some quick fixes using the wrangling interface and then created some wonderful visualizations as below.

  • Number of Modi’s country visits per month 

This chart shows the number of countries per month visisted by Modi. It shows that on an average he visited 2–3 countries per month in his tenure. The highest number of visits was in July -2015 when he visited around 6 nations in a single month

  • Country Visited

He almost visited countries in all the 5 major continents with major focus on countries in Asia.

  • Days spent in each country

To get in to more depth we also calculated and plotted the number of days which modi spent in each visit in any country. His longest stay was in USA which is a 7 days trip. He also spent 4 days in USA in one of earlier trips which makes it a total of 11 days in USA.

 

  • Reason of Modi’s Visit

From all his trips, around 67.6% of them are state visits. For 18.9% of time he visited the country to attend confrences like BRICS, G20, SAARC etc.


  • Total Budget of the trips

We do not have all the data, we plotted the amount of money spent on his visits for which we have the information. The actual value of these 16 foreign trips has been calculated to Rs 37.22 crore, among which the US and Australia have been the most expensive ones, amounting to about 40 per cent of the total expenditure.

  • Agreement signed and Speeches given

To take a more 360 degree view of things, we added in information on speeches given by Mr. Modi (excluding media interactions) and agreements signed (not including joint statements) on each visit.

We can indeed call Modi a frequent flyer.

Ideata Analytics Achieves “Yarn Ready” Certification on Hortonworks Data Platform

 

 

Ideata Analytics is now certified on HDP

 

We are excited to announce that Ideata Analytics is now certified on Hortonworks data platform (HDP). Users can now use Ideata analytics fast cycle analytics application on HDP with Hadoop and Spark.

 

Ideata Analytics provides an easy to use analytics platform for users to source, prepare and analyze data from various sources. With its extensive connector library, it gives users direct access to their data to perform deep dive exploration. It provides a self-service data preparation engine for users to perform data cleaning and enrichment on the fly. They can build advanced machine learning models to drive predictions and segmentation on their data. With inbuilt drag and drop functionality, users can quickly visualize their data and interactively slide and dice it to find hidden insights.

 

 

 

With the newly achieved certification, Ideata will further simplify and accelerate deployment of its advanced analytics application on HDP platform. HDP enables high intensity workloads on it’s yarn based platform with optimal efficiency. HDP’s secure, scalable and reliable enterprise data platform is an important component of the modern data architecture and users can now run Ideata Analytics on it and utilize it to process and analyze large amount of structured and unstructured data sets.

We’re hoping to enable HDP users to derive insights more rapidly from their data lakes without having to write a single line of code” said Pranjal Jain, founder, Ideata Analytics. “We’re very pleased to be able to work with Hortonworks and take our product to customers with an enterprise-ready, productionablespark and hadoop distribution

 

Learn More:

For more information about Ideata Analytics and Hortonworks, please visit:

http://hortonworks.com/partner/ideata-analytics/

To try out Ideata analytics for your analytics initiatives, please visit:

http://www.ideata-analytics.com/trial/

Visualizing YouTube Data in 5 easy steps

In this blog post, we will quickly upload, analyze and visualize the data of YouTube. The full analysis cycle will be performed using Ideata Analytics interface and will only take 10 minutes.

This exercise will help you understand how easy and fast it is to analyze your datasets in apache spark a using Ideata Analytics.

We have a sample youtube data file with us and we will try to bring out some insights like top rated videos, highest rated video, top users etc

 

Step 1 : Upload your data

We have a sample youtube file in tab delimited format downloaded. Ideata Analytics provide in box connector to upload the delimited file. The step is straight forward, we will create a new connection and click on the delimited file and provide necessary details like “tab” as a separator and click on upload  to upload  the file in the system.

 

Step 2 : Clean your data

Once the data is uploaded successfully we can see the preview of the data. It seems pretty structured data and do not need any formatting. The only thing missing in the file is column names which we can quickly provide on the preview screen. We will rename all the columns to what it represents like video id,   etc so that it is easy for us to understand the

Column Rename : The only thing missing in the file is column names which we can quickly provide on the preview screen. We will rename all the columns to what it represents like video id,   etc so that it is easy for us to understand the data on analysis screen, Once done, we will click on finish to create the dataset and make it available in the system for analysis.

 

Step 3 : Find out Answer of Question 1 :

We will now try to figure out which are the top video categories with the most number of  videos uploaded.

In order to answer the question, we will go to analysis screen and drag category column from the left panel into dimension(x-axis). This will show us a quick bar chart with all the categories along with its number of times it is present. As we are only interested in top 10 we can just specify that in right panel by selecting the checkbox of “limit” as top 10.

 

 

1
2
1

Drag the column “category” from left panel here

2

Limit the results to top 10 by selecting this checkbox

 

We will see the above chart which shows us the top categories of videos. We can see the actual number by clicking on the underlying data checkbox, which will show us the data table

 

Read more

Acquisition to Retention

“Who cares if we find out we lost a customer after he/she left?”

Using analytics to compete and innovate is a multi-dimensional issue. It ranges from simple (reporting) to complex (prediction). We are now in an era of man + machine interactions.
There is a drift from “Serving customers with few channels” to “integrating multi-channels, devices”, from “lone demographic segmentation” to “complex behaviour segmentation”

You have acquired a customer. Now what? Retain the customer, as the chances you can get existing customers to do more business with you is much more likely than what it takes to go out and get somebody new. An Organisation needs passion and precision when it comes to customer retention. The objective of performing analytics for ‘Retention’ is not just to understand why you lost a customer; but how to prevent you from losing one before it happens. In this blog we want to tell how important customer retention is and the ways to improve customer retention.

Here are few important ways to improve customer retention:

1. Customer Segmentation – Know The Customer

Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests, spending habits, and so on. Customer segmentation allows a company to target specific groups of customers effectively and allocate marketing resources to best effect. Better customer segmentation for customized services.
(Source: wisegeek.org)

Customer segmentation includes:
I. Collection of data
II. Integrating data from various sources (Our product ‘Connect’)
III. Data analysis for segmentation (‘Analyze’)
IV. Effective communication among business units (‘Share’)

2. Customer funnel optimization and improvement

Identify the blockage points –
· Are you getting enough leads at the top of funnel?
· How many visitors have converted to registered users?
· Are you able to catch the decision trigger?
· Are you able to bring back the customer?

Ideate on the causes of blockage points, solve them. Removing the blockage points will increase the conversion rates in your funnel.

3. Suggest products

You now got a customer through the funnel. What next? Suggest products which customers might like based on their previous buying history. Companies are now interested in understanding every aspect of customer interaction, websites purchasing patterns, social media, support calls, transactions etc. The combination of customer’s transactional history across channels with their online and social behaviour is the ‘Holy Grail’ here.
(Source: pascal-network.org)

Machine intelligence is getting increasingly married to human insight. Let us look at following interesting applications:

  • Recommendation based on social media relationships: “Several of your Facebook friends have recently enjoyed visits to our restaurant, so we’re offering you 15% off to try it yourself”
  • Recommendation with regard to cross-sell sales: “We hope you’ve liked the 55 inch LED TV purchased with us, as a token of loyalty we want to extend the gratitude and here’s a coupon for 20% off valid across all Home theatres”
  • Recommendation based on customer behaviours: “We are sorry we missed you this Sunday at Baskin Robbins after nine straight weeks of enjoying your company! Here is a free ‘Warm Brownie Sundae’ for you”
  • Recommendation based on location: “We see you have just landed in New York, and your final destination is The Plaza. Here is a $30 Uber voucher to get you there”

We identify the patterns of past purchases, browsing history, social media behaviours and build an algorithm using a training set to train the model and implement the algorithm thus developed on the desired set which gives us “Machine learning based recommendations‘’; Checkout our product ‘Ideata Analytics’ for ‘self service analytics’ with ‘easy to use’ interface.

4. Customer Satisfaction

The most important factor that drives to retain a customer. If you are successful in making your customer happy on their first purchase, chances are high in gaining them back. An organization needs to focus on the value and support provided to the customer during their engagement.
(Source: color.co.uk)

The following helps in maintaining healthy CSAT score:

  • Improved NPS (Net Promoter Score)
  • Increased customer satisfaction and fewer calls to call centers
  • First call resolution
  • Social Media engagement

As they say, “water water everywhere not a drop to drink”; Most of the data is unstructured which is hard to clean and bring into shape for analysis to work. Customer data analytics can unleash significant financial rewards for an organization’s sales, marketing and customer services. With so much data to contend with, companies often struggle with making sense of information from customers (segmentation), public records (social media data) and external databases (web history, buying pattern etc.); The aggregation of data renders any analysis on the individual customer level impossible. Profit and revenues are determined by a multitude of variables, which in addition are highly correlated. Data aggregation and correlation from these sources are crucial to find hidden insights. Ideata can help you achieve these goals. Lets ‘Ideata’