Thursday, February 19, 2015

Understanding Structured vs Big Unstructured Data


Understanding Big Unstructured Data v/s Structured Relational Data


The buzzword in the business world today is Big Data and Analytics. So, what is this big data? It is the massive amount of data that gets generated every second, every day due to the advent and popularity of social media networks. To understand all the hype associated with big data analytics, we need to get what the challenges are and what kinds of data matter to the organizations nowadays? To do this, let us first understand the difference between the kinds of data prevalent and relevant to companies today: structured and unstructured data.

Structured Vs Unstructured Data in Enterprises




Structured Data

Structured data is any data that is neatly organized and available in a user understandable format (organized as rows and columns in a relational database). It is easy for businesses to handle and extract useful information from structured data. However, it may not provide as much insights as the other variant, which is the unstructured data. Typical examples are data available in a relational database, Microsoft access etc.

Unstructured Data

On the other side of the spectrum, we have the messy, unorganized data called the unstructured data. It is very difficult to handle this type of data with traditional databases, as it does not adhere to a fixed, unambiguous format and takes up a lot of space. This explosion of unstructured data happened as a result of the development of social media like Facebook, Twitter and represents all user interactions and behavior. Oftentimes, this unstructured data is like a treasure that is yet to be discovered. It can give some game-changing insights to the extent that structured data cannot offer.  Some examples of unstructured data are e-mails, videos, images, and audio files etc. as all these data lack a definite structure.

Present State and Volume of Data

Now, let us look at the volume of both structured and un-structured data available in organizations. As you can see in the image below, the amount of space that unstructured data occupies is about 90% while only 10% of data was structured in the year 2014. It is also expected that in enterprises, the volume of unstructured data will grow at a much faster rate than structured data and this rapid explosion of unstructured data is what is called as big data.





Enterprises are looking for ways to leverage this massive amount of data and glean insightful information related to their customer base in order to gain the competitive edge.

Can a data warehouse provide support for handling this much sought-after unstructured data? The answer is no. While a data warehouse is pretty good with structured data, it cannot handle unstructured effectively. Tapping intelligence from an enormous and increasing volume of data with no specific format is difficult. Organizations are resorting to Hadoop based tools to handle and analyze unstructured or big data.

Limitations of Data warehousing in analyzing data types

While data warehouse is very good at storing structured data and giving users access to intelligence from this data. It cannot handle unstructured data directly due to the following limitations:
  • Lack of structure: unstructured data cannot be directly placed in a data warehouse because it lacks organization. Even if we manage to store unstructured data, no intelligence can be derived out of it. For example, we cannot identify if a tweet is positive/negative in its raw unprocessed form and storing this post in a data warehouse is useless if we cannot use it for analytics.
  • Volume of Unstructured data: As we saw before, about 80-90% of data in organizations is unstructured and this is constantly on the rise. Data warehouses were not built to handle such a large, constantly increasing volume of data.
  • Real-time data: Some of the big data sources feed data in continuously in real time and data warehouses do not have the capability to handle such real-time feeds.

Future Data warehousing Trends

Many experts feel Hadoop platform and Data lakes are the future and would give a tough competition and might possibly replace the enterprise data warehouses.

Others strongly disagree with this view and argue that a combination of Hadoop, enterprise data warehouse and relational databases is the way forward. They feel that while, Hadoop platform is ideal for dealing with big data, transactional and other structured data are best handled in a data warehouse. They propose the following modern architecture to handle the different data types involving all the new technologies like Hadoop and in-memory processing.

 Modern Architecture



I strongly feel that the combination of these powerful technologies is the future of business intelligence and analytics in organizations and Enterprise data warehouse is here to stay!

References:







Monday, February 2, 2015

  Comparison of popular BI and analytics tools


Business Intelligence and analytics products help businesses to make informed, data driven decisions. Hence, the analytics capabilities of these tools are an important factor when choosing to purchase from one of the vendors in the BI market segment. In this post, I have analyzed a few products and ranked them based on key parameters like cost, and user-friendly interface and functionality, customer support and user experience, product performance and support for big data analysis.

Let us look at some of the top vendors that featured in Gartner’s Magic quadrant for business intelligence and analytics platform, their strengths and weaknesses. I have chosen one or two products to represent each of the four quadrants: Visionary, Leader, Challenger and Niche Player.


#5 Alteryx

Alteryx features in the visionary quadrant in the magic quadrant results this year because of its focus on innovation and product performance. This product also offers excellent support for handling big data sources. It provides sophisticated spatial and predictive tools and handles various sources of data such as Hadoop, excel, Teradata etc.

However, Alteryx does not provide good data visualization options which most of the traditional BI tools offer. It is ideal for customers who do not need good visual outputs and who have very specific needs related to merging data and analytics.
It is also considered a high price solution and this is the main detractor that hinders customers from purchasing this product.

It comes in 3 packages and the prices are listed in the table below:


Alteryx Designer
$3,995 /Per User, Per Year
Alteryx Server
$85,000 /Per Server, Per Year
Alteryx Gallery
$1,500 /Per User, Per Year


#4 Logi Analytics

Logi Analytics provides a BI tool called Logi that provides a way to create reports and dashboards in any platform without the need to code. Logi analytics provides excellent customer support and improves the overall analytics experience while keeping the cost at the minimum. Logi is widely known in the BI market for its ease of use and customer-friendly user interface.

Logi is more of a development environment found usually used in small or medium sized companies with less number of users when compared to the other vendors.

#3 Pentaho

Pentaho is currently placed in the niche players’ quadrant, but is slowly moving towards the visionaries’ quadrant due to the constant innovative development efforts by the vendor. Pentaho, currently offers a full range of analytics and data integration platform with a heavy emphasis on big data. Thus, Pentaho constantly tries to keep itself up-to-date on the expectations of its customer base.
It is widely used by end users for the following capabilities:
  •         Low license cost
  •         Ease of use
  •        Data access and integration capabilities
Some of the drawbacks of this BI and analytics vendor are:
  •        Customer experience
  •        Some of the features are difficult to use and implement
  •        Product performance is a primary concern for users

#2 Qlik



Qlik is another market leader in the BI and analytics domain with capabilities in data discovery. This vendor is trying to bridge the gap between the needs of IT enterprises for data governance and the expectations of the business users. It is ranked in the top two vendors for its ease of use, especially for the interactive dashboard. Customers of Qlik have reported achieving higher business profits as a result of using QlikView BI tool.

It is widely for the following reasons:
  •        Lower cost
  •        Support for Big data analytics
  •        Ease of use
  •        Breadth of functionality offered
Qlik is however not so good in delivering visual interactive exploration and customer experience as it takes time for business users to become proficient in using this software. Qlik plans to solve this issue in the release of the next version called Qlikview.Next

#1 Tableau



Tableau has featured as a leader in Gartner’s Magic quadrant - BI and analytics for the fourth year in a row. Following are the reasons for this continued trend:
  •        Visual-based interactive data exploration experience that customers rate highly
  •       Tableau offers a broad range of support for direct-query SQL and MDX data sources, as well as a number of Hadoop distributions, native support for Google BigQuery, and support for search-based data discovery platforms, such as Attivio.
  •       Tableau provides a rich interactive UI at a price that is lower than the competitors in this market segment
  •       Excellent product quality and performance that has garnered the customer’s trust.
Now that we have had a chance to assess the pros and cons for going for each vendor, let’s look at their individual scores and ranking based on the criteria discussed above:


Criteria
Weight
Alteryx
Logi Analytics
Pentaho
Qlik
Tableau
Friendly UI and Functionalities
25%
5
7
6
8
8
Customer support and experience
25%
6
8
6
6
10
Cost
20%
3
7
8
8
9
Performance
15%
6
5
7
8
9
Big data analysis support
15%
7
5
9
8
9
Total
100%
5.3
6.65
7
7.5
9
Rank

5
4
3
2
1



As per my analysis of these vendors and based on the criteria that I considered for rating them, Tableau wins hands down without much competition.