Conceptual Foundations of Formulas stored in Database Fields 

13 may 2024

Examine the Underlying Strategies and Mechanisms of these Formulas, Prioritizing Robust Security and Data Integrity. Explore Defenses Including A Solid Type Checking Mechanisms and Prevention of SQL Injection.

In a former article, the concept of executing formulas stored in database fields was introduced.

To clarify, this functionality does not refer to the well-known calculated fields, but to their isomorphic counterpart of formula fields, where each record can contain a separate formula stored in a field.

This article does NOT cover:

  Select TpF, 5/9 * (TpF – 32) as TpC from  …

This article delves into the mechanisms behind formula fields that make it a safe and sound concept, alongside exploring other relevant aspects of how this functionality is implemented in the Colbert DBMS

  • Are there Security risks like SQL injection.
  • Separation of concerns.
  • Best practice for storing formulas in database fields.
  • Type checking formula fields stored in database fields.
  • How are formula fields implemented in the Colbert DBMS.
  • Performance of database queries involving formula fields.
  • Caching strategies for formulas.
  • Existing approaches for storing formulas in database fields.
 

A big Challenge and Risk?

People are often taken aback by capabilities like executing formulas stored in database fields, as they associate them with negative experiences involving scripting, SQL generation, stored procedures, SQL injections, and other vulnerabilities. Additionally, there are concerns that such functionalities might breach the principle of ‘separation of concerns’.

However, the point is that the approach presented in the article, and implemented in the Colbert DBMS,  is intended to address these vulnerabilities and provide a robust, secure environment.

We are accustomed to applying formulas to entire columns using calculated fields in SQL and have no problem with this. In fact, we gratefully use this option. The approach presented here offers the same, if not safer, opportunity, but now at a record level, as will be explained in this article.

Furthermore, it is not strange to include a formula in a database field. In databases, we register attributes of objects, and sometimes that attribute is an expression, such as the agreed bonus with your employer. Formula fields allow such an attribute to be registered and calculated within SQL. This provides us with a much more flexible, safer, and transparent method than writing a separate program outside the database.

The paradigm of separating code and data

The paradigm of separating code and data is a fundamental principle in software engineering and computer science. It revolves around the idea of keeping the logic (code) of a program distinct from the data it operates on. This separation brings several advantages. 

Storing formulas in database fields doesn’t inherently breach the principle of separating code and data; rather, it provides a structured approach to managing them together. Here’s why:

  1. Encapsulation: Storing formulas in database fields enforces, if implemented properly, a form of encapsulation. This entails bundling data (attributes) and methods (code) together within the stored objects(records). However, this does not violate the separation of concerns principle. Encapsulation ensures that calculated fields are accessed and manipulated only through methods without any potential side effects and therefore maintaining control over how calculated fields yield values in a certain context.

  2. Abstraction: While encapsulation binds data and methods within objects, abstraction allows data annalists to focus on the essential aspects of those objects without concerning themselves of writing some program outside the database to yield values defined by formulas per object.

  3. Separation of Concerns: Calculated fields facilitates organizing code into manageable units, each responsible for a specific aspect of functionality of the stored object. It provides mechanisms and best practices to maintain a clear separation of concerns and ensure modularity, reusability, and maintainability in software development.

Type checking formula fields

Crafting Clever Code, One Check at a Time.

By enforcing type constraints, type checking enhances the safety and reliability of programs. It prevents unexpected type conversions or operations that could lead to runtime errors or program crashes.

Type checking formula fields is not straightforward because a formula field is defined only by its result type; nothing is known about its parameters, including their number or types.

This means that, upon entering a formula, type checking only verifies whether the formula might yield the correct result type. During the formula’s execution, the actual parameters provided are checked for their types. Consequently, there is a twofold type-checking mechanism in place.

Twofold type checking 

type checking upon entering a formula

Consider the Employee table from the former article. This table is defined as:

Create table Employees
(EmpId int, Name string, Salary double, Bonus Expr<double>)

Employees

When entering a value for a formula field like: 

Insert into Employees set Bonus = EmpBen * 0.07

This SQL statement will execute successfully as EmpBen * 0.07 might yield a double. Although nothing is known about EmpBen yet, it could be a double or an int at the moment of executing.

Hoever, when entering a value for a formula field like: 

Insert into Employees set Bonus = EmpBen > 8000

This SQL statement will not execute successfully because the formula EmpBen > 8000 will never produce a result of data type double. The  > operator typically yields a boolean result, not a double.

type checking upon executing a formula

Referencing the former article, the computation of a formula field often involves a join with a related table, where the parameters become accessible within the formula field’s scope.
For instance, consider the Employee table and a separate table containing sales figures. The calculation of the bonus for each employee per year can then be achieved with the following straightforward SQL statement.

Employees
Sales

Select EmpId, Year, Name, Bonus 
from Employees natural join Sales

Annual bonus per employee calculated by the bonus formula

When executing the SQL statement, a formula field is parsed, compiled, and bound to its parameters. This process occurs once, and the formula is stored in cache. During this process, the following errors due to formula parsing might occur:

  ♦  ParamScope Error: A parameter might not be found.
  ♦  ParamType Error: A parameter has the wrong type.

And of course there might be run time errors like the ones that might occur in calculated fields like:

  ♦  ZeroDiv: A division by zero is done.
  ♦  Argument error: A function has a wrong actual parameter.

These error values function comparably to null values and can be assessed similar to the is null operator.
Instead of receiving error values as field values, the user can choose to let the execution of the SQL statement fail completely and raise an exception in case any error value occurs.
 
As an example, consider the following SQL statement. Although it lacks context with relevant parameters, it serves as an explanatory illustration.

Select DepId, ‘X’ as Some, code Bonus, Bonus from Employees

Formulas yielding error values.

Error values can be tested with postfix operators like:

  • is error:  any error
  • is param error: any parameter error
  • is paramscope error: parameter not found
  • is run error: any runtime error 
  • is zerodiv error: A division by zero is done.
  • etcetera.

Select DepId, ‘X’ as Some from Employees where Bonus is param error

What about SQL Injection 

Side Effects not Invited: The Drama Free Solution Zone

SQL injection is a type of security vulnerability that occurs when an attacker is able to manipulate or inject malicious SQL code into input fields of an application, such as login forms or search boxes, that interact with a database.

In Colbert’s implementation of formula fields, only mathematical expressions that yield values of basic field types such as int, double, string, datetime, or boolean, are allowed. Additionally, a formula field is strictly an expression. Furthermore, as functions might contribute in the expressions of a formula field, all functions in Colbert prohibit any side effects. This applies for the functions Colberts supplies as for the user defined functions.

Building upon the previously stated guidelines, it’s important to note that a formula field within Colbert’s implementation will never accept an SQL statement. This is because the result of an SQL statement is either a relation or an SQL message, neither of which qualify as basic field types. Even a statement like Select count(*) from Employees yields a relation in Colberts implementation which is not an integer.

Therefore, formula fields are inherently incapable of causing any side effects. Additionally, they cannot be utilized to compose SQL code with malicious injections, safeguarding the integrity and security of Colbert’s system.

Storing formulas in database fields.

Brainlessly effective

Colbert follows a simple concept for storing formulas in database fields. Upon entering an expression in a formula field, it is parsed and type-checked, but the expression is stored as a string in the underlying table. This is an effective approach, as much of the parsing, compiling, and parameter binding can only be done at the moment of execution when the scope is known. The compiled and bound formula field is stored in cache, which is only done once per formula field the first time it is accessed.
In real-life scenarios, this approach is almost never a performance bottleneck, as the number of formulas is typically limited compared to the likely large datasets they act upon.

Survey of Existing Approaches.

The examples provided here, as well as those in the previous article, are not far-fetched; they are questions that arise in real-life situations. It is therefore not surprising that these questions frequently appear in various internet forums where people seek solutions. Additionally, Database Management System vendors also address these types of problems.
Some approaches to addressing these problems are discussed here.

Oracle ™ Expression Filter

Oracle Expression Filter is a feature introduced in Oracle Database 10g, allowing application developers to store, index, and evaluate conditional expressions in one or more columns of a relational table.

It is, however, not easy to use. Users are required to specify an attribute set in advance, including all possible parameters and user defined functions, and then link that attribute set to the formula field. To execute the formula field, users must call an evaluate function and provide both the formula field and all parameters from the attribute set. Additionally, it’s important to note that expression filters can only produce a boolean result and cannot return any other database type.

Oracle declared Expression Filters obsolete in Oracle Database 12c.

SQL Case When

One way to solve the bonus problem without using formula fields in standard SQL is utilize a case when statement:

Select EmpId, Year, Name, 
case
  when Name = ‘Hank’ then 2200
  when Name = ‘Anna’ then (Revenue – Target) * 0.08
  when Name = ‘John’ then Year <= 2018 ? 1000 : 1200
  when Name = ‘Rosa’ then Salary * 0.07
  . . .
  . . .
  else 0
end as Bonus,

from Employees natural join Sales

As it represents a solution, it inherently presents certain drawbacks.

For instance, when dealing with a large number of employees, the case when statement can become challenging if not almost impossible to read and maintain. Additionally, inserting a new employee record necessitates updating the SQL statement.

SQL execute

Although not standard SQL, several Relational Database Management Systems support an execute or similar constructs for dynamic SQL execution.

This execute command is used to execute dynamically constructed SQL statements at runtime and is typically employed within procedural SQL code, such as stored procedures, functions, or triggers, rather than directly within a select statement.

However, you can indirectly utilize dynamic SQL within a select statement by embedding it within a stored procedure or a user-defined function and then invoking that procedure or function in a select statement.

In the previous example concerning bonus calculation, the declaration of such a user-defined function with dynamic SQL execution and its usage in an SQL select statement would be as follows:

create function CalcBonus
(Revenue double,
Target double,
Year integer,
Salary double,
Birthdate Date,
Tenure integer,
Formula varchar)

return varchar as Result boolean;
begin
  execute ‘begin : Result := ‘ || Formula || ‘; end;’
  using out Result;

return
  case when

     Result
then
       ‘TRUE’ else
       ‘FALSE’
  end;
end;

 

select *
from Buyers cross join Cars
where
CalcBonus(Revenue, Target, Year, Salary, Birthdate, Tenure, Formula) = ‘TRUE’

This solution also inherently presents serious drawbacks.

Firstly, since the formula field is just a string (varchar), anything can be entered and there is no type checking in place. This lack of validation poses a significant risk of SQL injection, where any code with malicious side effects could be entered.

Secondly, depending on the RDBMS in use, it might be difficult to determine the cause of a failure in case of an exception related to a specific formula.

Additionally, performance is a concern. In a cross join, the number of records is the product of the number of records from both tables. For each record resulting from the cross join, the formula must be parsed and compiled again, and all parameters must be passed.

Finally, if new parameters need to be addressed at a later stage, a new user-defined function should be written, and the select statement should be modified  accordingly.

Spreadsheet

When it comes to formulas, it’s common to hear the advice that this range of problems should be addressed using a spreadsheet, with the sentiment that ‘databases are for data, and calculations are for spreadsheets’.

Give it a try. Implementations that are both safe, simple, and maintainable are exceptionally rare, if not impossible.

Write a program

Another option would be to develop a program on top of the database. However, this approach is similar to the previous SQL execute solution and therefore shares the same drawbacks.

However, this approach could therefore potentially lead to a secure solution, provided that certain precautions are taken: 

  • An application must be developed to validate input formulas to prevent SQL injection and ensure the formulas yield the correct result type. 
  • Additionally, parsing, compiling, and parameter binding of the formulas must be implemented with the appropriate type check mechanisms and the support of default and user defined functions.
  • To optimize performance, compiled and parameter-bound formulas should be cached.
 

Despite the significant effort involved, even with these precautions, the approach results in a static application that necessitates rewriting whenever the context changes. Additionally, as the output is not a relation, it cannot be analyzed or enriched by incorporating it into any subsequent SQL statement.

A quick summary

In summary, Colbert proposes a straightforward approach to handling formulas stored in database fields with the following characteristics.

  1. Formula fields do not violate the principle of separation of concerns. Instead, they resemble Object-Oriented programming by encapsulating functionality.
  2. Type checking occurs twice: upon entering a formula and upon executing the formula when the context of parameters is known.
  3. There is no risk of SQL injection or other vulnerabilities since formula fields cannot be implemented in a way that causes any side effects while runtime errors are handled effectively.
  4. Formulas are stored as strings and parsed, compiled, and parameter-bound only once, upon the initial call. Once compiled, they are stored in cache.
  5. While there are existing approaches, they all come with significant drawbacks.

More articles

More articles can be found on the website of Colbert

Videos

Experience the Magic of Formulaic Concepts with Colbert

There are some videos available so you can see how easily formulas can be applied in Colbert’s versatile user interface.

Bonus calculation:

  • Every employee has a different bonus formula.

More examples

  • Criteria matching: Buyers define what second hand car they are looking for.
  • Risk analysis: A set of risk criteria is defined with corresponding risk scores.
  • Cohort definition: A population is defined into cohorts according to a set of rules.

Contact

© Bert Vegter 2024