--re-organized according to Infoq talk "ETL is dead, long live streams"
Data and data system changed a lot over the past decade. In the past, database and datawarehouse are main location for our data. And most of the database and datawarehouse are relational.
The recent data trend includes:
#1 single server databases are replaced by a myriad of distributed data platform that operated at a company-wide scale. A medium to large size company could have more than one data centre in different locations.
#2 There are many more non-transactional data: logs,images,sensors etc. No-sql database appeared and data blending work more handly.
#3 Stream data is increasingly ubiquitous,and faster processing is needed.
Therefore, the tranditional way of Extract, Transform and Load becomes a giant mess, the shortcomings are as below:
#1 it needs a global schema
#2 data cleaning and curation is manual and fundamentally error-prone
#3 operational cost of ETL is high and resource intensive.
#4 ETL tools were normally built to narrowly focus on connecting to databases and datawarehouse in a batch fashion.
Comparably, Streaming platform will have its shiny points:
#1 It is able to process high volume and high diversity data
#2 It is able to provide real-time processed data from ground level
#3 It enables forward-compatible data architecture.