Despite being the huge buzzword it is these days, there is still a lot of confusion about what big data is. Common questions and misconceptions that I hear quite often include,

  • How large is big data? What is the limit between large data and big data?
  • I am using Mongodb, so does that mean I have big data?
  • Is big data the same as having a data warehouse? Will big data analytics replace data warehousing?
  • How do I know my problem is a big data problem?

Lets try to answer the above questions in simple terms.

It’s Not Just About Size

Traditional large transactional data maintains the need to have structured schema, and to be consistent and reliable. This means that at any given point, if the same request comes from multiple sources (websites, mobile apps, reporting platforms), the data source will always return the same response to all the requests. Bank transactions and ERP systems are good examples of this type of data. Even though the size of the data would continue to grow, day by day, year over year, the need for reliability would remain the same. Traditional databases like MS SQL, MySQL and Oracle are well equipped to handle this kind of data. While sometimes very large in size, this is not big data.

Volume (size) is just one of the 7 v’s of big data. While huge in volume, big data is also defined by its unstructured, multi-source nature (social media streams, email marketing logs, website traffic logs), which traditional databases are not well equipped to handle and analyze. One of the most important distinctions between big data and large data is the speed at which the data must be captured and available for analysis, even if a small subset of that data has errors. As long as the majority of data is accurate, a small subset of corrupted data doesn’t matter.

Everything NoSQL is not Big Data

The NoSQL set of databases came about as a solution for various kinds of problems, where structured schema-oriented SQL databases were not ideally fit. While NoSQL databases are increasingly being used in big data type scenarios, there is no one-to-one mapping between the two. Not all NoSQL databases are ideal for big data problems, specifically based around analytics.

One of the most popular NoSQL databases in the market today is Mongodb. While Mongo does provide services like data aggregation, it is more closer to MySQL in its approach than Cassandra or a Hadoop oriented solution. Although Mongo allows heavily unstructured data, its storage pattern is closer to traditional databases like MySQL. On the other hand systems like Cassandra stores data in a way that allows faster aggregation functions, useful in analytics.

Data Warehouses or Big Data Analytics

The common term data warehouse is based on a standardized software pattern called integrated data warehouse. Data warehouse is an architecture on organizing data in a certain way. What it effectively tries to achieve is a consistent, generally denormalized (flat, like a spreadsheet), domain oriented (sales, purchasing etc.), time lapsed (15 minutes behind, 1 hour behind, a day behind), and read-only version of business data from various sources (CRM, POS, ERP etc.), organized to provide a particular set of business intelligence answers in an efficient manner. A data warehouse contains many subject areas which enables cross-organizational analysis. One of the most important aspects of a data warehouse is that data is “clean” (no garbage data). This is achieved by having a standardized schema to the data.

Big Data is a technology used to store data that is inherently unstructured, non-standardized, and generally not consistent. Telling someone that installing a big data system is a replacement for a data warehouse is wrong.

Ecommerce Use Cases

Possibly the most obvious use cases for big data are around ecommerce. Insight into customer data and understanding the needs and buying patterns of customers is the holy grail to gain competitive advantage. Amazon is the best example of the benefits of utilizing big data properly. By using massive amounts of consumer data to predict what customers are most likely to purchase next, they has increased their revenue many fold.

The following are some of the key use case benefits big data brings to an ecommerce business:

  • Customer Service and Loyalty – Big data allows businesses to better understand their customers needs, and use it to be better serve them. Examples include the recommendations made by Amazon or Netflix.
  • Personalization – Data from these multiple touch points can be used to provide the shopper a personalized experience, including content and promotions. Loyal users can be better rewarded and new users are incentivized in real time.
  • Value Driven Pricing – By taking data from multiple sources (yay APIs!) like competitor pricing and regional trends, big data analytics determines the price for the customer to close the sale
  • Predictive Analysis – By helping determine the changing trends in product purchases, big data allows a business to make decisions on inventory stocking, and potentially negotiate better rates with suppliers
  • Reduce Shopping Cart Abandonment – Research shows that consumers use as many as three or five devices or platforms during the course of their buying journey. Mapping this better allows businesses to better help their customers make the right decisions.

Is My Problem a Big Data Problem?

Ecommerce is not the only domain where big data can be helpful. More and more businesses are tapping into the data they already own, and finding new things about their business and clients. Here are some examples where big data is being used in creative ways:

  • Process Optimizations – Delivery system companies are utilizing GPS data to track delivery routes, speed, performance, and scheduling. UPS used this kind of Big Data  to optimize routes and save massive amounts of time and money. Uber is able to predict demand, dynamically price journeys and send the closest driver to the customers.
  • Healthcare – Government agencies can now predict flu outbreaks and track them in real time and pharmaceutical companies are able to use big data analytics to fast-track drug development.
  • Security – Law enforcement agencies use big data to foil terrorist attacks and detect cyber crime.
  • Sports and Wellness – With the new boom in wearable gadgets, consumers are catching up with their favorite sport stars in tracking daily workout and other health related data.

Is my problem a big data problem? The simple answer is yes, you just need to be more creative in finding the use. I hope my post has given you some insights on what big data is, and what it is not.