We never had more options to choose from as data sources neither more ways to express and digest the data, therefore comparing data is a silent and growing problem for the modern analytics, but first let’s understand the big picture with a fictitious and successful business to illustrate how this problem becomes more and more common as your data and company grows.
Startup level.
Size: < 5 people
Core business: Online perfume sales.
Databases:
1x Transactional DB from the web store (sales and products)
1x 3rd party vendor for online marketing.
Web tracking (commonly Google Analytics).
Spreadsheets (commonly MS-Excel): Inventory, providers and re-stock purchases.
The problem in this scenario is that usually most reports are created from spreadsheets where data is hard to trace back to the original source and losing control of different report versions as well as data manipulated happens so easily and with a lot of luck the following problems will not happen at the worst time possible, but “no big deal” at this stage no more than 5 people are making reports:
Row count match but totals don’t – Data updates or duplicates on sub-aggregated data (this is probably one of the most common and dangerous issues).
Row count match and totals match but details are different – Product management, descriptions, corrections, etc.
Totals and row counts don’t match, time to find missing data or duplicated information – This is very common and great for learning a lot of tricks that will be useful only 1 time.
Count of Distinct Keys (Row ID) match but totals and row count doesn’t – Very common mistake while joining data from different tables.
Same Company 4 year after:
Size: 60 people
Core business: Annual subscriptions to an exclusive catalog of small bottles of perfumes.
2nd layer: Perfume samples and 1 time perfume sales.
Databases:
1 Transactional DB from the web store.
1 Transactional DB from sales vendor transactional system.
Sales, Customers ,Product management, Affiliates & Funnels
2x 3rd Party marketing vendors
3x Social media data
3 party vendor data
1x New Data Warehouse is created
Let’s be real, at this point, all the data comparison problems in the 1st stage of the company prevails and come with the following new scenarios:
Creating the Data Warehouse – Creating a DW is no short of a titanic task and before everyone can create beautiful data visualizations sourced from fancy fact and dimension tables there will be a team spending a lot of time making sure the new reports can match the accuracy of those good old reports in spreadsheets.
New reports are created based on the DW – Now is when those common mistakes from SQL such as overlooking a 1-N relationship will make your totals explode.
Going forward as the DW and data marts grow and business process and flow changes, there will be a recurrent scenario of having a business user with a spreadsheet pointing that a particular report used to be different a month ago (which often times coincides with a deployment of a couple of columns added into one of the underlying tables for that report).
Starting this stage we just have to multiply all these problems X all the people consuming data.
Conclusion
Comparing data is an unavoidable common task in a data driven organization that can affect anyone from a legendary data rockstar to a new analyst, by not having the right tool for this task can cause low progress, frustration and affects multiple people as the problem persists, just like a flu.
This is the reason behind JuxtAPPose, compare data… simple.
Comments