An Agile Data Warehouse

Data Warehouses are normally seen as big, expensive, lengthy, waterfall projects, using complex and costly tools. They don’t need to be. As someone who has helped several teams to build successful data warehouses, I know that an Agile approach is much better, and in this talk I will explain how we do this. We can deliver incremental, useful reports and applications within weeks of starting the project and can continue to consolidate an organisation’s data, improving its quality and the meaningful insights it can deliver. At the same time we develop wider understanding of the data, often leading to improved business processes.

What is a Data Warehouse? and Why should you want one?

you want a single view of what the customer is -> application-neutral, organisation specific information

vs application-specific information

in the past, we were not allowed to query the production database because they were so slow -> born of Data Warehouses

more Data Warehouse assumptions:

temporal data: keep data over a period of time, application already have that information also
aggregated data: summary data only, to make it smaller … nobody wants this any more, you want all the data

these days:

contain all the data
contain all the details

must be:

as simple as possible
very reliable
fast

Big Project, Big Vendor and why and how to avoid them

whatever the size of the org because they span most, if not all systems

you should avoid them:

because they will drag all your money
their solutions are more expensive, more complex and less efficient

how to avoid:

trust your own developers, your own team, they know what they are doing
if you can’t understand what the implementation partner is saying, don’t employ them

they will do a traditional approach:

armies of strategy consultants, data modellers, enterprise architects working for years to design the ultimate solution

How an agile approach works well for Data Warehouse project?

first talk to your users
find out what they use most
find out something they want but cannot do now
find out where this data is
copy it into your “staging” read
profile it
improve it
build something that uses it
repeat

we have a point on the horizon where we want to go

-> now we start to build the small bits and pieces one at a time to get their

=> Incremental Data Model

it will be better than one that is designed up-front

… as long as we refactor ruthlessly

most things will not change, but when they do, we must change the related data and model, it is not as bad as you might think

you start with a small data model, then you add more, and you add more, and you add more, …

common, “universal” data models do not work
understanding your organisation’s data, and how it is used, is hard
when you understand it, then the data model is easy
and you can change it easily -> don’t worry about getting it right first

you need to understand how databases work

DHH has always argued that if you are writing an application backed by a relational database you should damn well know how a relational database works.

– Martin Fowler

books:

Relational Database for Agile Developers, Ron Ballard
SQL Antipatterns, Bill Karwin

A simple architecture pattern that works

Source => SQL (extract data) => Staging (do joins, …) => SQL (restructuring and cleaning) => Warehouse

repeat every night

big organisations reckon it takes 4 months what we do in 4 hours!

Complete Refresh vs Changed-Data Capture

for most tables, it is feasible to drop and recreate every night

for very large tables:

identify changes by timestamp, if reliable
triggers in source to identify changes
difference calculation -> flat files (see approval tests from Emily)

The ETL Tool Solution

really complicated, inefficient

-> the equivalent SQL Version

far simpler

why would you buy a tool that would do all of that?

=> Short Simple Script to populate target tables from source tables + a controlling script

-> no views, no stored procedures

Small Scripts

154 bytes to 440 bytes
some scripts can get to 100 lines

Useful Tools

data profiling: (Ben wrote one)

shows column names, data types

calculated data: what is the minimum, maximum value; how many distinct values; max actual length

frequencies: most common values, common data patterns
data quality monitoring: automated testing on all the data that has been loaded into the data warehouse -> dashboards, drill-down, charts

Book: Small, Sharp, Software Tools, Brian Hogan

about the Linux command line

on the left are mandatory, on the right are personal choices

Tool
wiki	Track, far better than most commercial wikis like Confluence
Linux	non-negotiable, I will not build a data warehouse on a Windows Server
database	PostgreSQL, because it is free, rock-solid, it hosts one of the biggest datasets in the world
source-code control	SVN
plain text editor
procedural language	Java

Free Tools from thedatastudio.net: https://www.thedatastudio.net/download.htm