Pattern matching & Text analysis

Haddock

Within Colbert we extended SQL with a powerful domain specific language for analysing and mutating text.
With this language, called Haddock, any analysis or manipulation of text is possible.

What are the key benefits ?

By integrating Haddock into SQL you can solve complex text problems within your DBMS by simply writing a SQL statement.

  • You can save a lot of time and money because you don’t have to write a program outside your database to solve text problems.
  • Perform complex updates on text fields.
  • Extract useful information from text fields.
  • Perform count, modify or test operations and combine them as required. 
  • Tackle known problems of regular expressions.
  • Structure unstructured text to a normalized data model.
  • Speed up discovery and profiling

How does it work ?

Haddock treats text as a string of characters in which you can freely move right and left.
Haddock keeps track of the position in the string by a certain register.
There are more registers for copying text, counting etc.
Haddock support a lot of one letter operators that perform test, find, substitute, move, delete or count operations.
With this you can perform almost any operation on text.
Combined with the simplicity of function definition in Colbert, you can now very easily create any string function.

Example

The challenge: Get the second number from a text:

         ‘Hank 12 Main road 9788 Married 23987′ :: {G’0-9’ M’0-9’*} 2 Q@ 

Explanation: 

::
G’0-9′
M’0-9′
*
{  …  } 2
Q@

Make the string a Haddock object
Go to the first character between 0 and 9
Move current character to the buffer register if it is between 0 and 9
Perform last operation as often as possible
Perform enclosed operations 2 times
Quit Haddock yielding the buffer register as a result.

You can also define this as a function and call it subsequently with some parameters or use it in a SQL statement.

Create a function

Call GetNum to get second number

Use the function in SQL 

Create func GetNum as Source :: {G’0-9′ M’0-9’*} Nth Q@

GetNum(‘Hank 12 Main road 9788 Married 23987’, 2)

Select GetNum(Info, 2) as Zip from UserInfo

Get a copy of our white paper on Haddock