Structure raw data

What about structuring raw data ?

When data is not well organized in a neat relational data model, you can’t easily access it with SQL. This might be due to poor design but it might also be the case if you receive some data from a non-relational source like a text file.
With Colbert you can give this data a well-structured relational appearance and access it in SQL as if it is a set of normalized tables.
Colbert achieves this by introducing a virtual table called Normalize which extracts data from the raw data without copying or moving the original data.
You don’t need additional software or skills. It is all seamlessly integrated in our SQL.

What are the key benefits ?

With Normalize you can access raw data as structured normalized data by SQL.

  • You can save a lot of time and money because you don’t have to write a program.
  • You don’t need additional software or skills. It is all seamlessly integrated in our SQL.
  • Normalize does not copy or move or change the original data.
  • When your original raw data changes, Normalize is changed too.
  • Normalize works very well with Haddock to define how data must be structured.
  • It allows you to quickly test ideas and iterate towards a solid solution
  • You can speed up discovery, profiling and documentation. 

How does it work ?

Normalize is a virtual table that can only be used as a Child Table in a join.
With Normalize you define how data is extracted from the Parent table which holds the raw data. You can normalize a series of consecutive fields or a series of values in a text field.

Simple example

Consider the table Sports. With Colbert you access it as normalized data by the next SQL statement:

Select EmpId, ASPort
from Sports join Normalize Sport as ASport *

* The default separator in normalizing text is a comma.
Sports table not normalized
Sports
Normalized table
Normalized Sports

Another example

Consider the real life problem where a text file Diag.txt holds records with a name, per name some sub-codes starting with ‘pws‘ and per sub-code some sub-sub-codes.

Select
Line :: F’pws’E*B’ ‘* as Name, PwsSeq:: F’ ‘E* as Pws, SubPws
from
diag text outer join
normalize Line as PwsSeq from ‘pws’ outer join
normalize PwsSeq as SubPws from ‘ ‘

  • The second Normalize is applied on the result of the first Normalize.
  • Note that the result is just an SQL statement that might be used as any normal SQL statement.
  • Haddock is used to extract relevant data.
Diag text file
Diag
Normalized diag
Normalized Diag