Prerequisites

Big

Ever wonder how Netflix recommends such strangely precise movie categories?

Or how Google Maps can predict an upcoming traffic jam in a small town?

Intro

Well, the answer usually has something to do with big data.

In this track, we'll learn about what defines big data, its positives, and its negatives.

Here

First, let's take a look at some everyday devices and the data they might be generating.

Big

The amount of data generated by devices has vastly expanded in recent decades.

A typical maps application might generate a data point each time a device changes location.

{
 "place":"Vienna"
 "coordinates":[48.2082,16.3738]
}

Doesn't look like much yet right?

Everyone

Every device that runs the same map application produces a similar data point.

{
 "Device":"234l"
 "place":"Vienna"
 "coordinates":[48.2083,16.3738]
}
{
 "Device":"27xS"
 "place":"Vienna"
 "coordinates":[48.2083,16.3738]
}

The amount of data generated begins to grow.

Lots

A maps application records more than just data about the device model and its location.

It's likely recording things like the time and orientation, among many others.

{
 "Device":"27xS"
 "place": "Vienna"
 "coordinates": [48.2083, 16.3738]
 "time": "18:00"
 "orientation": "north"  
}

We can see that the data generated begins to expand rapidly.

Thought

Just one application can produce a lot of data.

Imagine all the other devices, such as cars, cameras, wearables, checkout scanners, producing similar data.

Record

Thanks to increasingly cheap storage costs, everything can be recorded. There's no reason not to, even with terabytes or petabytes of data.

This new era of data is described in terms of the three Vs. That is volume, variety, and velocity.

Volume

Volume refers to the amount of data produced.

{
 "location":[52.5200,13.4050] 
 "name":"Berlin"
}
{ 
 "location":[52.5200,13.4050]
 "name":"Berlin"
}
{
 "location":[52.5200,13.4050]
 "name":"Berlin"  
}

There's a continual increase of data stored at terabyte, petabyte, and even larger scales. 

Velocity

Velocity refers to the speed at which new data accumulates in databases.

{
 "photo":"b8463855"
 "liked":"true"
 "time": "12:00:00:"
}

{
 "photo":"f46e"
 "liked":"true"
 "time": "12:00:01:"
}

This application is generating a new data point every second.

Variety

The final V, variety, refers to the multitude of sources that are generating data.

Let's assemble the most likely configuration of this data sample.

{
 "name":"Eiffel Tower"
 "location":[48.8584,2.2945]
}

{
 "photo":"b8463855"
 "liked":"true"
 "time":"12:00:00:"
}

{
 "chargeLevel":"96"
 "DataConnection":"true"
}

It's likely that a typical cell phone is generating lots of data points just like these.

Decisions

Being able to store and analyze lots of data has opened up entirely new possibilities of making intelligent decisions.

Reliable

All of the available data helps improve processes like decision making and prediction.

What do you think benefits most from big data?

Exactly. Thanks to big data it's now easier to back up decisions with facts, instead of intuition.

Examples

Let's take a look at some simplified examples.

Video streaming

Based on this data, which is the best recommendation for the next movie on this video streaming platform?

category: Comedy liked: Yes
category: Comedy liked: Yes
category: Horror liked: No

Next up: Comedy

//Output Below

Hot Fuzz is up next!

Exactly! This type of recommendation would've been challenging to achieve without recording users' watching habits.

Real time

In this ride-sharing app, what would be the optimal location to push more drivers too?

{
 "request": "6 passengers",
 "location": [City center]
}
{
 "request": "2 passengers",
 "location": [Habour]
}
{
 "request": "2 passengers",
 "location": [City center]
}

Exactly! Without recording the location and number of passengers, it'd be difficult to meet demand.

Real predictions

Unfortunately, data doesn't always paint a clear and actionable picture.

category:Comedy liked: Yes
category:Thriller liked: Yes
category:Romance liked: Yes
category:Comedy liked: No
category:Thriller liked: No
category:Romance liked: Yes
 
Next up: ?

//Output Below

¯\_(ツ)_/¯

Isn't it much more difficult to recommend a category based on this data? 

Business intelligence

Different analytical methods can help unlock patterns and allow for predictions using big data.

Analytic types

Big data analytics can be broken down into three broad categories.

They can be descriptive, predictive, and prescriptive analytics. 

Descriptive

Descriptive big data analytics uses past data to help describe the current and past state of things.

It includes statistic calculations like averages, medians, and totals. 

Lifetime value: 40.00
Lifetime value: 50.00
Lifetime value: 60.00

Average value: 50.00

Big data helps descriptive analytics paint a more accurate picture and is often used in the construction of analytics dashboards. 

Predictive

Predictive big data analytics help figuring out what will happen next. 

This type of analytics uses algorithms and historical data to make decisions about probable behavior in the future.

Prescriptive

Prescriptive analytics help to understand what might happen in different scenarios. 

Let's outline the basic steps involved in prescriptive analytics.

  1. Analyze outcome of first scenario 
  2. Analyze outcome of second scenario
  3. Decide on best scenario to proceed with

Exactly! That's the basic concept behind prescriptive analytics.

Problems

Big data isn't without its problems though. 

Working with such large amounts of data might lead to wrong assumptions and biased decisions.

Confirmation bias

Confirmation bias is the tendency to interpret information in a way that confirms a pre-existing hypothesis.

What do you think is likely to contribute to confirmation bias?

  • Looking at a subset of the data

Exactly. Looking at only subsets of data may contribute to confirmation bias.

Larger Bias

Only looking at data might also lead to more substantial biases.

Areas like healthcare, finance, and education can be negatively affected by unfair data confirmation biases.

Availability

The availability heuristic is another bias in which the most recent or available data is believed to be the most important.

Can you guess what could likely contribute to this problem?

  • A recently read newspaper

Alright! Recently received data is more likely to be perceived as important when compared to older data.

Misuse

Securing big data is another concern. Unauthorized parties can exploit large datasets for personal gain.

What sources do you think might be susceptible to exploitation?

  • Social media data

Exactly. Social media sites record lots of personal information, which might be susceptible to exploitation.