# Data Scientist | Relational Algebraic significance in Data Science

June 21, 2015 Leave a comment

We have been using relational database for decades now. The normalisation of database is the key. It resides on 3 base components – Structures, Constraints and Operations. Although, this is one of the format source of data, which data scientist will encounter, it will also be necessary for data scientist to store values. Could be used to store processed data for data visualisation or perform further manipulations.

As a data scientist, it is necessary to understand and underpin that programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation. There is algebraic of tables which performs operations on tables like – Select, Update, Insert and Delete along with other projection through columns using join with other tables. This may include aggregate, union, difference, cross product and many others.

Hence, it is necessary to understand relational algebra with relational database, such that, it will help to a data scientist, to do optimizations and legitimate the dealings with data while performing various data operations. As an example, using the algebra laws of arithmetic operations – division, multiplication, addition and subtraction, it will reduce an overhead to execute query and deliver in quality response time. Often data scientist neglect the simple laws, which may impact the result sets. Like, in this equation, p = ((z*5) + (z*8) + a) / b, the variables denoted alphabetically *a*, *z*, and *b* will be replaced with value during evaluation. If *a* is always 0 and *b* value is 1, there will be additional algebraic operations being evaluated. Remember we are talking on large gigs of datasets, not fewer mbs.

As a data scientist, you should always think about numbers instead of tables and columns. In earlier expression, if you evaluate if *z* is 2. So, adding 0 to any number or dividing any number by 1, will not make any difference to the output. Hence, it is necessary to break the expression, validate and re-write the expression which may perform less evaluations and cycles. This is symbolic reasoning, and it is necessary to understand, computers follows the arithmetic instruction on a given expression. It will not suggest or perform any kind of *symbolic reasoning.* Hence, it is quite expensive when objects that you will manipulate is not mere integers, rather you may be dealing with terabytes sized tables, than this kind of symbolic reasoning will not work.

So, it is question of cost based optimization, as it is unnecessary that short version expression works all the time. Relational database engines parse SQL query into relational algebraic before execution, it is associate query execution cost depending upon volume and quality of query in where/group by/having clauses. It is always to perform various expressions on sampling data and choose the one with the lowest cost. Putting it note that performing query on a table(s), will always return a table, call it as *result-set* or *dataset *or* dataview*.

Hence, you must also consider logical expressions of associative (right, left or cross). Furthermore, these are often combined with logical operations like *AND*, or *OR* and negating conditions like *NOT.*