Data Engineering

From Computing Study
Jump to: navigation, search


Big Data

What is Big Data?

Big Data refers to datasets that grow so large that it is difficult to capture, store, manage, share, analyse and visualise with the typical database software. There are 6 key characteristics that are required to understand and utilise Big Data. These are known as the 6 V's.


  • Huge Size
  • Created by machines and human interaction


  • Various formats , types and structures
  • Structured and unstructured
  • Text, numerical, images, audio, video, sequences, times series, social media data, multidimensional arrays
  • Static data vs streaming data
  • A single application can be collecting multiple types of data
  • To extract knowledge, all these types of data need to be linked together
  • A single view to the customer
    • Social media, entertainment, gaming, purchases, known history, banking and finance


  • Social media, real-time data
  • Data is being generated fast and needs to be processed fast
  • Online Data Analytics
  • Late decisions - missing opportunities
  • E-promotions - based on your location and purchasing history and what you like - send promotions right now for the store next to you
  • Healthcare monitoring - sensors monitoring your activities and body - any abnormal measurements require attention


  • Truthfulness, especially when consistent or habitual. Precision or accuracy.
  • Biases, noise and abnormality meaningful to the problem
  • Spelling mistakes
  • Audio recognition


  • Product recommendations that are relevant and compelling
  • Improving the marketing effectiveness of a promotion while it is still in play
  • Preventing fraud as it is occurring and preventing more proactively
  • Friend invitations to join a game or activity that expands business
  • Learning why customers switch to competitors and their offers; in time to counter


  • Correct and accurate for the intended use, key to making the right decisions

Utilising Big Data

Aggregation and statistics - Data Warehouse and OLAP

  • OLTP - Online Transaction Processing (DBMS)
  • OLAP - Online Analytical Processing (Data Warehouse)
  • RTAP - Real-Time Analytical Processing (Big Data Architecture and technology)

Indexing, Searching and Querying

  • Keyword based search
  • Pattern matching (XML, RDF)

Knowledge Discovery

  • Data Mining
  • Statistical Modelling

What Is Driving Big Data?

Predictive Analysis

  • Optimisations and predictive analysis
  • Complex statistical analysis
  • All types of data and many sources
  • Very large datasets
  • More of a real-time

Business Intelligence

  • Ad-hoc querying and reporting
  • Data mining techniques
  • Structured data, typical sources
  • Small to mid-sized datasets

Who Is Collecting What?

Who is generating Big Data?

  • Social media networks
  • Scientific instruments
  • Mobile devices
  • Sensor technology and networks

What is collecting all this data?

  • Web browsers
  • Search engines
  • Smartphones and apps
  • Tablet computers and apps
  • Games boxes and GPS systems
  • Internet service providers
  • HDTVs and Blu-ray players with built-in internet connectivity
  • Film rental sites
  • Hospitals and other medical systems
  • Banking and phone systems
  • Apps
    • Restaurant reservations (Open Table), weather e.g. for the next three days, side effects of medications (MediWatcher), 3-star hotels (PriceLine), Which PC should I buy and where? (PriceCheck)

Who is collecting all this data

  • Government agencies
  • Big pharmaceutical companies
  • Consumer products companies
  • Big box stores
  • Credit card companies

What data are they collecting?

  • Airline ticket
  • Restaurant bill
  • Location
  • Grocery bill
  • Hotel bill

Why Are They Collecting All This Data?

Target Marketing

  • To send you catalogues for exactly the merchandise you tend to purchase
  • To suggest medications that precisely match your medical history
  • To 'push' television channels to your set instead of pulling them in
  • To send adverts on those channels 'just for you'

Targeted Information

  • To know what you need before you even know you need it based on past purchasing habits
  • To notify you of your expiring drivers licence or credit cards
  • To give you turn-by-turn directions to a shelter in case of emergency
  • To personalise your experience

What Are The Impacts Of Big Data?

  • Decisions like your credit score and your insurance premiums may be based on the analysis of Big Data, for good or bad
  • After Haiti's earthquake, Columbia University tracked the movements of 2 million refugees by the SIM cards in their mobile phones and were able to determine where health risks were likely to develop

Challenges In Handling Big Data

  • The bottleneck is in technology: New architecture, algorithms, techniques are needed
  • Also in technical skills: experts in utilising the new technology and dealing with Big Data

Data Warehousing

Definition of a Data Warehouse

  • Subject Oriented
  • Data is categorised and stored by business subject rather than by application
  • Integrated
  • Data on a given subject is defined and stored once
  • Time variant
  • Data is stored as a series of snapshots, each representing a period of time
  • Non-volatile

A Data Warehouse is a database built for the purpose of analysis. ...enabling both long-term strategic planning and agile responses to new market conditions.

An enterprise structured repository of subject-oriented, time-variant, historical data used for information retrieval and decision support. The Data Warehouse stores atomic and summary data.

Facts, Dimensions and Granularity

Fact - data that changes value - how many units / account balance

Dimension - level of detail - smaller is better

Always define one grain smaller than the client asks for.

Analysis of potential benefits for all employee types within an organisation of performing a data warehouse pilot

  • It enables the designers to prove the model, the data and the access tools and evaluate tests against acceptance criteria.
  • It enables the users to see how easy the access tool is to use, to enhance their data requirements and to measure query performance
  • It enables the developers to determine whether the ETT process is adequate and modify it accordingly; to identify any issues with the metadata presented to the users or used by ETT; to determine the near future, and possibly even long-term user requirements, to identify and define the user's training needs, to test access levels and security of the systems data, to monitor performance
  • Ensure that acceptance criteria are documented and agreed upon. Identify volume and scalability tests and develop a test plan with test cycles. Once the tests are executed you can gather statistics on performance and optimise where necessary
  • Test the entire process of refreshing the data and produce a report that contains a complete and detailed evaluation of this proof and a plan for the subsequent one with any impact on current projects identified

Data Cleaning, Data Scrubbing, Data Anomalies

Data cleaning / scrubbing is a technique using pattern recognition and other AI techniques to upgrade the quality of raw data before transforming and moving the data to the warehouse. It is an essential stage in the construction of a data warehouse due to the fact that data in operational systems is often poor.

Missing values problem

Example: missing supplier ID

Supplier ID Supplier Name Address
102 Fred Bloggs 4 Toy Town
Jo Harper


  • Ignore
  • Wait
  • Mark rows
  • Extract when time-stamped
  • IF Supplier_ID = NULL THEN Supplier_ID = UNIQUE SEQUENCE gen num

Multiple Encoding

Example: 3 different databases in an organisation, each trying to store the sex of a person

DB1: M, F
DB2: 1, 0 → M, F

DB3: Male, Female

Transformation process should ensure that one encoding is adopted within the warehouse. Pick a format and transform to uniform the data

Referential integrity Problem

Example: Two tables have been taken from alternative sources within the organisation. Clearly there is only one set of departments, yet an employee has been assigned to Dept 60 (in the employee table) which does not exist. Also of you delete a department, does it orphan any employees?

Employee Name Department
1099 Smith 10
1289 Jones 20
1234 Doe 50
6786 Harris 60


  • SQL anti-join
  • Server constraints
  • Dedicated tools

Name and Address Problem

Example: Below shows two extracts from two tables in two different databases within the same organisation.


Name Address Location
Roy Rogers 6 State Road N100
Roy Rogers ^ State Road, NW, M22 N100
Clare A. Peterson , M33 N121
Bob and Jane Smiley 12 - 14 Range Warehouse N333


Name Address Location
Rogers, R 3244
Peterson, A. Claire  !, M33 1256
Smiley, Bob and Jane 12 to 14 Range Warehouse 4566
  • No unique key
  • Missing values
  • Personal and commercial names mixed
  • Different addresses for the same member
  • Different names and spelling for the same member
  • Many names on one line
  • One name on two lines
  • Single field format vs multiple field format
  • Different encoding of address ranges


  • Create atomic values
  • Verify data accuracy
  • Match with other records
  • Identify private and commercial addresses and residents
  • Check place matches postcode
  • Document in metadata

Multiple Files Problem

Multiple source files, e.g. Multiple Dept

Solution: Create logic rule to detect correct source

Outliers - values that are significantly removed from the values for the rest of the data

Noise - values that don't fit in with the data in some way, could be garbage text. If noise is useful it becomes an outlier

Multi-Dimensional Database, OLAP and OLTP

What are OLAP, and OLTP and the difference between them?

OLTP - Online Transactional Processing

Concerned with the day-to-day operations - the organisation's operational database.

  • Add product to order
  • Accept payment
  • Dispatch order

OLAP - Online Analytical Processing

Concerned with patterns, trends, summarised information - the organisation's data repository for data analysis: data warehouse.

  • Present the daily average unit sales per product and per month of the current year
  • What is the total sales of product X this year in contrast to that of last year
  • Which products are likely to be bought together with product X

Data from OLTP systems is fed into the OLAP

Cubes - Multi-Dimensional Data Modelling

Superior performance and usability than recreational modelling. Dimensions / categories used to summarise the data instead of grouping attributes in the relational world. Measures are consolidated values of each group instead of aggregation functions of the relational database.

  • Each row is a point in multi-dimensional space (facts)
  • Each column set is a dimension
  • Dimensional value & aggregate value = key & measure

Data Cube

  • 4D cube = 3D cube with multiple additional dimensions such as supplier

Roll Up - reduce the level of detail

  • second > minute > hour > day > week > month > year
  • city > region > country > all

Drill Down - increase the level of detail

  • week > day > hour > minute > second
  • all > country > region > city
  • year > quarter > month > week > day

Pivot - rotate the visualisation

Dice - isolate a section of all dimensions creating a mini cube

e.g. Location ('London' or 'Paris')

and Time (Q1 or Q2)

and Product ('video' or 'music')

Slice - for time e.g. Q1

Explanation of Roll Up and Drill Down

Where location is a dimension defined by city > region > country and sales are a measure in a multi-dimensional database R.

Roll-up reduces the level of detail and aggregates the data further. For example, a roll-up from city to region would aggregate the breakdown of sales from individual cities to whole regions. In turn, drill-down does the opposite, i.e. increases the level of detail. For example, drilling down from country to region would break down aggregate sales by country into individual regions in each country.

Aggregation of data in Oracle SQL using using grouping, windowing and partitioning

The partitioning stage takes the rows of a table (e.g. with information on employees of a company) and creates partitions according to the different values of the provided set of partitioning attributes (e.g. the department in which employees work, with a partition of each of {Finance, HR, ...})

    SELECT name, department, salary,
    Rank() OVER (PARTITION BY dept ORDER BY salary DESC) as rank
    FROM E) AS a
WHERE a.rank = < 3;

This example would show the top three salaries for each department, along with the name of the employee and their department.

Concept of a frequent itemset

An item is a positive attribute-value pair in a boolean dataset, and an itemset is a set of items. A frequent itemset is simply an itemset whose support, i.e. number of occurrences in the dataset is higher than a threshold.

The Apriori Property

The apriori property states that if a set fails a test, all its supersets will also fail the same test. Therefore if ABC is frequent, AB must also be frequent. However, not much can be said about ABCD. We would have to know whether all its supersets are frequent in order to become a candidate; only then its frequent status would be determined by obtaining its support.

Purity measures and their usefulness in the context of decision tree building

Purity measures such as Error Rate, Entropy and Gini Index are used to obtain objective indicators of the degree of variability (or uniformity) of a probability distribution. The more variable it is, the more confident one can be about predicting its outcome. This premise is used by decision tree algorithms in the selection of the best nodes so that they lead to more accurate and simpler trees.

Naive Bayes

What is a classification?

Given a collection of records (training set), each record contains a set of attributes and one of the attributes is the class. Find a model cor class attribute as a function of the values of the other attributes, with the goal being that previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually the given data is divided into training and test sets with the training set used to build the model and the test set used to validate it.

Why mine data?

The drivers for mining data are to help businesses respond more easily to changes in the business environment, with focusses on the customer, the competition and the data asset. It enables a revised approach to business decision making.

Overview of Naive Bayes

    P(B given A) = P(A ^ B) / P(A)

The conditional probability of event B in a relationship to even t A is the probability that event B occurs given that event A has already occurred.

Bayes Rule

    P( H given E ) = (P( E given H ) and P( H ))/P( E )

The conditional probability of a hypothesis given some evidence is the conditional probability of the evidence given the hypothesis multiplied by the unconditional probability of the hypothesis divided by the unconditional probability of the evidence.

Problem: A maths teacher gave her class two tests. 25% of the class passed both tests and 42% of the class passed the first test. What percent of those who passed the first test also passed the second test?

    P( second given first ) = P( first and second ) / P( first )

0.25 / 0.42 = 0.6 = 60%

Weaknesses of Naive Bayes

Naive Bayes assumes that variables are equally important and that they are independent, which is often not the case in practice. Nevertheless, practical results for Naive Bayes are usually good.

Naive Bayes is damaged by the inclusion of redundant (strongly dependent) attributes. E.g. if people have expensive houses then including both income and house price in the model would unfairly multiply the effect of having low income.

Sparse data: if some attribute values are not present in the data then a zero probability for P[E|H] might exist. This would lead P[H|E] to be zero no matter how high P[E|H] is for other attribute values (because multiplying by zero always gives zero). Small positive values which estimate the so-called 'prior probabilities' are often used to correct this.

Probability Of Play

Outlook temp Humidity Windy Play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

14 records

Play (yes) = 9 / 14 = 0.642857143

Play (no) = 5 / 14 = 0.357142857

Outlook Play (yes) Play (no)
sunny 2/9 0.22 3/5 0.6
overcast 4/9 0.44 0/5 0.0
rainy 3/9 0.33 2/5 0.4
Temp Play (yes) Play (no)
hot 2/9 0.22 2/5 0.4
mild 4/9 0.44 2/5 0.4
cool 3/9 0.33 1/5 0.2
Humidity Play (yes) Play (no)
high 3/9 0.33 4/5 0.8
normal 6/9 0.66 1/5 0.2
Windy Play (yes) Play (no)
true 3/9 0.33 3/5 0.6
false 6/9 0.66 2/5 0.4

Play (yes) = 0.44 x 0.44 x 0.66 x 0.66 = 0.08433216

Play (no) = 0.0 x 0.4 x 0.2 x 0.4 = 0.0

Divide by overall probability:

Play (yes) = 0.8433216 / 0.642857143 = 0.13118336

Play (no) = 0.0 / 0.357142857 = 0.0

Probability of Play (yes) is higher and is the expected outcome

Predicting the dataset for a given scenario

Outlook Temp Humidity Windy Play
overcast mild normal false  ?

p(no) = 5/14 = 0.357

p(yes) = 9/14 = 0.643

p(outlook = overcast | play = yes) = 4/9 = 0.444

p(temp = mild | play = yes) = 4/9 = 0.444

p(humidity = normal | play = yes) = 6/9 = 0.666

p(windy = false | play = yes) = 6/9 = 0.666

p(outlook = overcast | play = no) = 0/5 = 0

p(temp = mild | play = yes) = 2/5 = 0.4

p(humidity = normal | play = no) = 1/5 = 0.2

p(windy = false | play = yes) = 2/5 = 0.4

play = yes – 0.444 * 0.444 * 0.666 * 0.666 * 0.643 = 0.056

play = no – 0 * 0.4 * 0.2 * 0.4 * 0.357 = 0

Therefore for the unseen record, play is predicted to be yes

Predicting The Diabetes Dataset

Body Weight Body Height Blood Pressure Blood Sugar Level Habit Class
heavy short high 3 smoker P
heavy short high 1 non-smoker P
normal tall normal 3 non-smoker N
heavy tall normal 2 smoker N
low medium normal 2 non-smoker N
low tall normal 1 non-smoker P
normal medium high 4 smoker P
low short high 2 smoker P
heavy tall high 2 non-smoker P
low medium normal 3 smoker P
heavy medium normal 3 non-smoker N
Body Weight Body Height Blood Pressure Blood Sugar Level Habit Class
heavy medium high 1 smoker  ?


– P = 7/11 = 0.636

– N = 4/11 = 0.364

p(BodyWeight = Heavy | Class = P) = 3/7 = 0.429

p(BodyHeight = Medium | Class = P) = 2/7 = 0.286

p(BloodPressure = High | Class = P) = 5/7 = 0.714

p(BloodSugarLevel = 1 | Class = P) = 2/7 = 0.286

p(Habit = Smoker | Class = P) = 4/7 = 0.571

p(BodyWeight = Heavy | Class = N) = 2/4 = 0.5

p(BodyHeight = Medium | Class = N) = 2/4 = 0.5

p(BloodPressure = High | Class = N) = 0/4 = 0

p(BloodSugarLevel = 1 | Class = N) = 0/4 = 0

p(Habit = Smoker | Class = N) = 1/4 = 0.25

Class P = 0.429 * 0.286 * 0.714 * 0.286 * 0.571 * 0.636 = 0.0090

Class N = 0.5 * 0.5 * 0 * 0 * 0.25 * 0.364 = 0

Therefore the unseen record is predicted to be Class P

Data Mining and Decision Trees

Definition of a frequent itemset

An item is a positive attribute-value pair in a boolean dataset, and an itemset is a set of items. A frequent itemset is simply an itemset whose support (i.e. number of occurrences in the dataset) is higher than a threshold.

The Apriori Principle

If R(A,B,C,D) is a transaction database and ABC is a frequent itemset in R, what can be said about AB and ABCD in R regarding being frequent or not? The apriori principle states that if a set fails a test, all its supersets will also fail the same test. Therefore, if ABC is frequent, AB must also be frequent. However, not much can be said about ABCD. We would have to know whether all its supersets are frequent in order to become a candidate. Only then its frequent status would be determined by obtaining its support.

Purity Measures and their usefulness in the context of decision tree building

Purity measures, such as Error Rate, Entropy, and Gini Index are used to obtain objective indicators of the degree of variability (or uniformity for that matter) of a probability distribution. The more variable it is, the more confident one can be about predicting its outcome. The premise is used by decision tree algorithms in the selection of the best attributes for nodes so that they can lead to more accurate and simpler trees.

When setting the level of frequency, too high gives you not enough data, and too low gives you too much.

SQL Data Analysis

Why is Windowing more efficient than GROUP BY

Typical data analysis requires the computation and comparison of multiple aggregate levels. The use of Windowing allows these to be performed with a single pass, or fewer passes over the data.

Does this GROUP BY query work?

SELECT caller, receiver, max(duration)
FROM calls
GROUP BY caller

The query does not work because all attributes in the SELECT clause must either appear in the GROUP BY clause or appear as the parameter of an aggregate function. This is not the case for 'receiver'. Presumably the intention behind the query was to output for each caller, the longest call. But for that to work, the use of sub-queries or of a non-standard extended SQL would have been required.

Show the longest call made by each caller

    SELECT caller, receiver, dataAndTime, duration, pricePerMin,
    Rank() OVER (PARTITION BY caller ORDER BY duration DESC) as pos
    FROM calls
    ) AS q
WHERE pos = 1;

Compare the total duration of phone calls made in Dec 2014 with those made in Jan 2015 for each caller

SELECT caller, sum_dec14, sum_jan15, (sum_jan15 - sum_dec14)
    SELECT caller, SUM(duration) as sum_dec14
    FROM calls
    WHERE TO_CHAR(dateAndTime, 'MON-YY') = 'DEC-14'
    ) AS q1 JOIN (
    SELECT caller, SUM(duration) as sum_jan15
    FROM calls
    WHERE TO_CHAR(dateAndTime, 'MON-YY') = 'JAN-15'
    ) AS q2 USING (caller);

Let E be a (simplistic) table with schema E(name, department, salary), containing employees details of a company. Using the Windowing approach to grouping data, write a query that shows for each department, the employee receiving the highest salary

    SELECT name, department, salary,
    Rank() OVER (PARTITION BY dept ORDER BY salary DESC) AS rank
    FROM E) a
WHERE a.rank = 1;

Using the table PruductSales, write two queries using the Windowing approach that respectively determine

The highest value for 'Milk' for each year

    SELECT product, year, month, sales,
    Rank() OVER (PARTITION BY year ORDER BY sales DESC) AS ys
    FROM ProductSales) AS ps
WHERE product = 'Milk' AND ps.ys = 1;

The average sales of 'Honey' per month and year

SELECT DISTINCT product, year, month, AVG(sales)
    OVER (PARTITION BY year, month) AS "SalesByYearMonth"
FROM ProductSales
WHERE product = "Honey";

Rank() - Allocate and potentially order by a rank number based on determined criteria.

Row_Number() - Allocate a unique identifier to each row returned per partition