all bits considered data to information to knowledge

13Oct/110

Java is losing ground (…yawn…)

The Tiobe Programming languages index came up with the following October headline: Java is losing ground despite its new version 7 release.

I'd say - its about time, just following the all Java frameworks becomes a full time job, not to mention mastering them...

Following the trend towards weakly-typed languages both Lua and JavaScript posted healthy gains; it would be interesting to see Google's Dart performance - "a class-based optionally typed programming language for building web applications"

Sudden popularity of Objective-C can only be explained by a continued craze of iPod/Phone/Pad; I fully expect it to retreat as more Android contenders move into the niche, and Apple's iron grip on the development market slips (see my post on how Amazon.com had circumvented it with HTML5 app)

C# is still climbing the stairs; I have suspicion that it rules supreme in Windows platform development world. The decision to discontinue support Mono project that would allow .Net Framework to be used on Linux platform was, IMHO, a shortsighted one, and will come back to haunt Microsoft in the future; maybe Miguel de Icaza can pull it through with his Xamarin project.

I attribute raising popularity of SQL procedural extensions such as PL/SQL and Transact-SQL to the growing dissatisfaction with ORM (such as Hibernate and MS Entity Framework) among the developers; in many cases the developers' productivity gained from the ability to work in a familiar environment is all but negated by the poor performance as a result of the inefficient query syntax that such frameworks tend to produce...

2Sep/110

When sum of parts is more than the whole: INDEX splitting

Not all indices are created equal; some are more useful than the others.

In general, having a useless index should not hurt query performance – as long as your RDBMS query optimizer chooses the right one to use… Applying simple logic cold help you to speed up your query performance without delving into black art of execution plans and query hints.

Consider a fairly common data piece such as address, say, “100 NE Elm Street, Laurel, Nebraska, 68745

The [city], [state] and [zip] are natural candidates to be placed into fields of their own; in some data models you could see a highly normalized schema where each data element in street address would also  be assigned its own field (and sometimes would be placed into different tables, too) – [100],[NE],[Elm],[Street]   But it is far more likely that the entire character string “100 NE Elm Street” will be stuffed into a single [ADDRESS] field like this:

adress

100 NE Elm Street

100 SE Oak Street

300 NE Elm Street

300 NW Aspen Street

700 SW Elm Street

An index on such a field would be fairly useless for pattern searches because the numbers in front of the address are treated as part of the address character string, and full table scan is likely to be performed for a query like this:

SELECT * FROM [table] WHERE address LIKE ‘%ELM%’

Creating a FULL TEXT index would help in this situation, but it is a fairly expensive solution for small text fields like this. One possible alternative would be to split the address into the separate fields of more appropriate data types (e.g. INTEGER for the street number) – or add these fields to the table (yes, it will result in increased maintenance and/or data duplication – so the trade-offs must be carefully considered)

street_num

street_geo

adress

100

NE

Elm Street

100

SE

Oak Street

300

NE

Elm Street

300

NW

Aspen Street

700

SW

Elm Street

Now indices created on the columns (fields) [STREET_NUM], [STREET_GEO] and [ADDRESS] – or combinations thereof - can be efficiently utilized by query optimizer.  

Comparative tests run in Microsoft SQL Server 2005 RDBMS on a table with a half million rows in it showed approximately 10-fold performance increase for the query with LIKE predicate.

As with every index, it has to be maintained and statistics for the table must be updated regularly to keep optimal performance.

Again, TANSTAAFL principle fully applies here – the increase in speed is paid for by increased complexity and maintenance (more fields to populate and keep in sync, more indices to create and maintain, etc.)

NB: the original idea for the post was formulated by Andrew Tappert - a senior developer with Oregon Health Authority.

28Apr/110

Inconsistent behavior in Oracle 10g XE Web Interface

Oracle 10g Express edition comes with a nice web interface to it - APEX - Oracle Application Express which is a  rapid development tool for Web applications on the Oracle database.

While using the tool for my Discovering SQL book, I have noticed some inconsistent behavior which I attribute to DHTML implementation in this particular version (10g Express).They appear to be OS dependent since I have not encountered these in either Windows XP or Windows Vista installations.

 

The inconsistencies of the browser behavior can be grouped into two major categories - local and remote.

Accessing Oracle 10g XE with APEX on local machine (e.g. http://127.0.0.1:8080/apex/)

The Opera(v. 11.10)  browser along with Chrome (v. 10.0.648.205) and Firefox (v.  4.0) running on Windows 2003 Server - all cannot display uploaded scripts as shown on the pictures below. This behavior is not dependent on length of the script or on script file encoding (UTF-8, code pages etc). I supply exact versions used, but this will hold true for the earlier versions as well. The only browser that correctly displayed the script on local machine was Internet Explorer  (v. 8.0.6001.18702),

The gallery: Chrome, Firefox, IE, Opera; then remote IE, Opera and Firefox, respectively.

Accessing Oracle 10g XE with APEX on remote machine (e.g. http://remotemachine.net:8080/apex/)

All the browsers break when a source of the script is accessed through Apex interface remotely (only Internet Explorer gives a meaningful error). The fifth picture in the above gallery shows Oracle PL/SQL error along with DHTML tags.

15Apr/110

Oracle® 11g Database Express Edition

The Oracle 11g was released in 2007 but without corresponding "Express Edition"; the developers were encouraged to use Oracle 10g Express instead. This gap is about to close with this beta release: Oracle Express Edition for Oracle 11g

It's too late to be added to my Discovering SQL book now (I used Oracle 10g Express), but I will be adding scripts and slide presentations to reflect this development at my support site.

18Feb/110

Elementary, my dear Watson!

A new era of was officially introduced on February 14, 2011 with an IBM Watson computer has taken upon a “uniquely human” activity - playing Jeopardy games. The machine was named after IBM founder Thomas J. Watson (in case anyone was wondering about why it was not named after Sherlock Holmes), and it represents a next giant step towards something that was dubbed “artificial intelligence” in 1956, and was almost exclusively in the domain of science fiction ever since.

For a long time it has been understood that simply to possess information does not equal ability to answer questions, let alone the intelligent ones. A search engine, even the most advanced one, relies on keywords to search for information; it is up to humans to come up with clever string of keywords, and it is ultimately human task to decide whether information returned constitutes an answer to the question. Watson takes it a step further - it has to figure out the question, deduct the context, and come up with statistically most-probable answer. This is very different from the Deep Blue computer which beat chess grandmaster Garry Kasparov in 1997. The chess game can be reduced to a set of well defined mathematical problems in combinatorics, a very large set to be sure, but ultimately susceptible to number-crunching power of the computer - no ambiguity, no contextual variations. The IBM Watson had to deal with uncertainty of human language; it had to interpret metaphors, it had to understand nuances of human language.

The tables had turned again - instead of humans learning machine’s language to query for answers it’s the machine who learned to understand questions posted with all ambiguity of the human language. With clever programming algorithms the computer was able to “understand” natural language query, and come up with a correct answer - most of the times, that is.

Does Watson use SQL to come up with the answer? The details of implementation is a closely guarded secret, at least for now. Given the limitations imposed by the Jeopardy rules, narrowly focused purpose and relatively modest computing power (around 2,000 CPU even though “connected in a very special way”- according to Dr. Christopher Welty, a member of the IBM artificial intelligence group, a far cry from 750,000 cores the IBM Mira super computer being built for DOE’s Argonne National Library), it is most probably did not use relational database to store data but rather relied on proprietary data structures and algorithms to search and retrieve the information. Eventually, these advances will make it into the mainstream database technology, and the way we transform data into information into knowledge will change, again. The future is near.

Update: IBM will incorporate Nuance CLU speech-recognition applications into the Watson supercomputer to provide information that assists doctors as they make diagnoses.

2Dec/101

I call VIEW frozen SQL query

Johann Wolfgang von Goethe, a late eighteen - early nineteen century German writer and philosopher, "considered by many one of the most important thinkers of the Western Civilization", once remarked: I call architecture frozen music. He might have been paraphrasing his compatriot Friedrich Wilhelm Joseph von Schelling, who expressed similar idea thirty years earlier.

Regardless, I think this definition perfectly captures the nature of an SQL view construct. Hereby I claim authorship: VIEW is a frozen query.

Most of the books on SQL introduce views as part of "Creating database objects" chapter, and I am as guilty as anyone having done so in my SQL Bible. Now I believe that it should belong to the multi-table queries chapter, with secondary appearance in the chapter on SQL security.

P.S. should you find references to the phrase used in SQL context earlier than November 29, 2010, I'd be happy to relinquish it 🙂

1Dec/100

A CLAUSE or an OPERATOR?

Once in a while I come across internet articles explaining basic concepts of SQL to the beginner and intermediate programmers; and every time I see confusion in terminology. This time it was a discussion of what author called “EXCEPT clause” in SQL Sever.

Except that it is not a CLAUSE, it is an OPERATOR.

A definition of a clause is “a group of words containing a subject and predicate and functioning as a member of a complex or compound sentence”; therefore WHERE, GROUP BY and ORDER BY are rightfully classified as clauses.

An operator is defined as “something … that … performs a mathematical or logical operation” (Merriam Webster definition). Therefore EXCEPT is an operator performing logical operation upon two sets of data returned by the queries.

There are clauses in SQL to be sure: a WHERE clause, ORDER BY clause, HAVING clause, but neither EXCEPT nor its opposite INTERSECT belong into this category. They  are operators.

There is a Chinese Proverb stating that the beginning of wisdom is to call things by their right name. This could be a good start.

11Jun/100

Keeping up with database changes

Scenario: several developers are hard at work cranking out code. The application under development relies on RDBMS back-end for persistent storage (in this particular case, the database is Microsoft SQL Server 2005, but the technique described applies to any RDBMS supporting DDL triggers). Developers are making changes to the client application code, creating/altering/dropping database objects (stored procedures, tables, views etc.) and, in the heat of the moment, forgetting to communicate the changes to their teammates left alone the project manager...

Yes, I know - this is not how it supposed to happen, and yet in the world out there, more often than not, it does happen... Here are some do-it-yourself ideas on how you could alleviate the pain and spare you some nasty surprises without buying more tools...

Enter DDL Triggers. This is relatively new feature with Microsoft SQL Server (though Oracle had them for ages), and, among many other things (rolling back changes, for instance), it could be used to solve the problem stated above.

A DDL (Data Definition Language) trigger in MS SQL Server can have two scopes - server and database. The Table 1.1 at the end of this post lists all the events for which DDL trigger could be created, grouped by scope. For the full syntax in creating a DDL trigger please see vendor's documentation; here I will only touch basics needed to illustrate a solution.

Here's a database scop trigger we are going to use to monitor events:

CREATE TRIGGER [tr_DDL_ALERT] ON DATABASE ---- trigger is created in context of a given database
FOR CREATE_TABLE, DROP_TABLE, ALTER_TABLE    ---- which events to capture; see Table 1.1 for full list
AS         ----
use DDL_DATABASE_LEVEL_EVENTS captures all DB events
SET NOCOUNT ON
DECLARE @xmlEventData XML ---- the generated event data is in XML format
SET @xmlEventData = eventdata() ---- get data from the EVENTDATA() function

Now, this trigger would not be much of use to anybody; you need to parse information contained in the XML message passed into your trigger upon the event. You could parse it and send an email message, or you could save it into a database, or both.

The following code saves it into a table [tbDDL_ALERT] - which, of course, has to be created beforehand:

INSERT INTO dbo.tbDDLEventLog
(
EventTime
,EventType
,ServerName
,DatabaseName
,ObjectType
,ObjectName
,UserName
,CommandText
)
SELECT REPLACE(CONVERT(VARCHAR(50), @xmlEventData.query('data(/EVENT_INSTANCE/PostTime)')),'T', ' ')
,CONVERT(VARCHAR(100), @xmlEventData.query('data(/EVENT_INSTANCE/EventType)'))
,CONVERT(VARCHAR(100), @xmlEventData.query('data(/EVENT_INSTANCE/ServerName)'))
,CONVERT(VARCHAR(100), @xmlEventData.query('data(/EVENT_INSTANCE/DatabaseName)'))
,CONVERT(VARCHAR(100), @xmlEventData.query('data(/EVENT_INSTANCE/ObjectType)'))
,CONVERT(VARCHAR(100), @xmlEventData.query('data(/EVENT_INSTANCE/ObjectName)'))
,CONVERT(VARCHAR(100), @xmlEventData.query('data(/EVENT_INSTANCE/UserName)'))
,CONVERT(VARCHAR(MAX), @xmlEventData.query('data(/EVENT_INSTANCE/TSQLCommand/CommandText)'))

And sends out email notifications using potentially obsolete extended stored procedure (assemble message (@body variable) from the elements of the XML message as shown in the example above):

EXEC master..xp_smtp_sendmail
@TO = 'me@somewhere.com'
,@from = 'someone@somewhere.com'
,@message = @body
,@subject = 'database was modified'
,@server = 'smtp.mydomain.com'

Long-term solution would be, of course, configuring SQL Server Database Mail.

In my next post I will describe how database triggers could be integrated with Hudson - an open source Continuous Integration (CI) server.

Table 1. List of the values to use with server and database scope DDL triggers

Server Scope Database Scope
ALTER_AUTHORIZATION_SERVER
CREATE_DATABASE
ALTER_DATABASE
DROP_DATABASE
CREATE_ENDPOINT
DROP_ENDPOINT
CREATE_LOGIN
ALTER_LOGIN
DROP_LOGIN
GRANT_SERVER
DENY_SERVER
REVOKE_SERVER
CREATE_APPLICATION_ROLE
ALTER_APPLICATION_ROLE
DROP_APPLICATION_ROLE
CREATE_ASSEMBLY
ALTER_ASSEMBLY
DROP_ASSEMBLY
ALTER_AUTHORIZATION_DATABASE
CREATE_CERTIFICATE
ALTER_CERTIFICATE
DROP_CERTIFICATE
CREATE_CONTRACT
DROP_CONTRACT
GRANT_DATABASE
DENY_DATABASE
REVOKE_DATABASE
CREATE_EVENT_NOTIFICATION
DROP_EVENT_NOTIFICATION
CREATE_FUNCTION
ALTER_FUNCTION
DROP_FUNCTION
CREATE_INDEX
ALTER_INDEX
DROP_INDEX
CREATE_MESSAGE_TYPE
ALTER_MESSAGE_TYPE
DROP_MESSAGE_TYPE
CREATE_PARTITION_FUNCTION
ALTER_PARTITION_FUNCTION
DROP_PARTITION_FUNCTION
CREATE_PARTITION_SCHEME
ALTER_PARTITION_SCHEME
DROP_PARTITION_SCHEME
CREATE_PROCEDURE
ALTER_PROCEDURE
DROP_PROCEDURE
CREATE_QUEUE
ALTER_QUEUE
DROP_QUEUE
CREATE_REMOTE_SERVICE_BINDING
ALTER_REMOTE_SERVICE_BINDING
DROP_REMOTE_SERVICE_BINDING
CREATE_ROLE
ALTER_ROLE
DROP_ROLE
CREATE_ROUTE
ALTER_ROUTE
DROP_ROUTE
CREATE_SCHEMA
ALTER_SCHEMA
DROP_SCHEMA
CREATE_SERVICE
ALTER_SERVICE
DROP_SERVICE
CREATE_STATISTICS
DROP_STATISTICS
UPDATE_STATISTICS
CREATE_SYNONYM
DROP_SYNONYM
CREATE_TABLE
ALTER_TABLE
DROP_TABLE
CREATE_TRIGGER
ALTER_TRIGGER
DROP_TRIGGER
CREATE_TYPE
DROP_TYPE
CREATE_USER
ALTER_USER
DROP_USER
CREATE_VIEW
ALTER_VIEW
DROP_VIEW
CREATE_XML_SCHEMA_COLLECTION
ALTER_XML_SCHEMA_COLLECTION
DROP_XML_SCHEMA_COLLECTION
3Jun/102

Continuous integration with SQLCMD and Hudson

If you are not doing continuous integration, you should; and if you are - then you ought to consider database install as integral a part of your build process.

Most CI servers out there would allow you to execute batch or shell commands, and virtually every RDBMS provides a command line utility (and creating one on your own - if needed - is rather trivial).

Installing a database as part of your build process, and populating it with data could play role in your unit testing strategy, and should definitely be considered integral part of functional and regression testing procedures.

The following gives but an example of how to make MS SQL Server database install a part of your build process utilizing Microsoft command line utility SQLCMD and open source continuous integration server Hudson. This could be applied to any other RDBMS package - MySQL, PostgreSQL, Oracle, DB2 or Sybase - with minor adjustments.

The command line utility can be downloaded separately, or installed as part of SQL Server 200X installation. If your unit tests require database support, it might be a good idea to install free SQL Server Express Edition which could be started as part of the build process and shut down afterwards.

"The sqlcmd utilitylets you enter Transact-SQL statements, system procedures, and script files at the command prompt, in Query Editor in SQLCMD mode, in a Windows script file or in an operating system (Cmd.exe) job step of a SQL Server Agent job. This utility uses OLE DB to execute Transact-SQL batches."

This provides an opportunity to make creation of a database and all dependent database objects a part in your continuous integration build process with Hudson - an open source continuous integration serverthrough executing scripts - either integrated with your build management utility such as Maven, Ant or MSBuild - depending on your platform, or just plain batch or shell commands.

A very basic Windows batch command in Hudson installing database through SQLCMD might look like this:

sqlcmd –S<IP address>,[port]  -U<user> -P<password> -dmaster  -i%WORKSPACE% \exec.sql
  • -S indicates IP of the SQL Server instance to connect to
  • - U and –P  - user ID and password, respectively (this example uses SQL server Authentication)
  • -d specifies the default database to connect to, and [master] database is the one you would want if creating a database is part of your build process.

NB: for complete commands list see documentation. Keep in mind that UserID/Password are in clear text, and will be sent over the network as such (unless you are using DAC). To minimize amount of hard-coded use include files in your script.

Here is an example as SQL code could be organized, in order of execution (I will link script files soon):

1 exec.sql main controller of the database installation process
2 constants.config contains declaration of all variables to be used in the script; note that file extension is irrelevant for execution
3 backupDB.sql backup existing database (if present); note that backup directory must exist on remote computer
4 createDB.sql create new database; note that all the paths must exist on the remote computer
5 createTables.sql creates all tables in the database; it might include creation of indices and constraints as part of the script but I would advise against it because of the potential dependencies conflicts
6 createFunctions.sql creates all the user-defined functions for the database; the order in which objects are created in the database is important, placing functions before [views] and [stored procedures] reflects common dependency pattern as both could use the functions.
7 createViews.sql creates all views
8 createProcedures.sql creates all procedures
9 createConstraints.sql adds constraints to the objects: primary keys, foreign keys, indices etc.
10 importData.sql if your database has static data this could be used to add it at creation time; you may want to switch 9 and 10 as your data might potentially violate constraints (e.g. orphaned records); this also could be used in unit testing strategies
11 createUsers.sql add all users; this script assumes that logins are already created (if not, add script to create logins first)
12 grantPrivileges.sql grant privileges to the objects (e.g. EXECUTE)

Gotchas:

It is important to understand that GO command completes the batch execution and flushes the buffer; it makes SQLCMD “forget” everything you might have declared prior to executing the command. In the above example, all variables declared in [constants.config ] are no longer part of the script once the GO command was issued.

When creating scripts, keep in mind differences between local (Hudson) directories and remote (SQL Server) ones. The former refer to location of the SQL script files checked out by Hudson from your source control, understood by SQLCMD and Hudson only;  the latter specifies directories that  SQL Server understands – backup and database locations.

SQLCMD takes in arguments in clear text which constitutes potential security breach; use it in fully trusted environment. Alternative would be implement workaround such as local batch files in secure directories with hard-coded userID/Passwords, and rely on Hudson security matrix; only users with access to the server would be able to see it. This does increase maintenance butb is relatively easy to implement.

If you want SQLCMD generated messages to be displayed in Hudson console output do not specify output file. Alternatively,  I could envision a plugin that would parse the output file, and present it nicely in Hudson environment; I might take a stab at it, time permitting.

The successful execution of the scripts relies on correct order of creation – you must figure out object dependencies, and factor it in your scripts. Unfortunately, this is classical Catch 22 – the reliable way to determine dependencies is to query SQL Server after the objects has been created… Which means that you ‘d have to run all the script manually first, and adjust your scripts accordingly.

The utility allso allows you to perform many administrative tasks. For example, ability to re-create test environment on demand can save many hours of developers' time, and being able to backup and/or restore database could be such time saver. Here is an example to restore database from a backup to a local SQL Express database

sqlcmd -S .\SQLEXPRESS -i restoreDB.sql -v database="%1" -v root="D:\backups"

the [restoreDB.sql] might contain something like this:

USE MASTER
GO
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[$(database)]') AND type in (N'U'))
  ALTER DATABASE $(database) SET SINGLE_USER WITH ROLLBACK IMMEDIATE;

RESTORE DATABASE $(database)
FROM DISK = '$(root)\$(database).bak'
WITH REPLACE;

GO

Caveat: The above script accepts all the default options contained in the backup such as location of the data and log files; if, for some reason, such restore paths do not exist on the target machine, the restore operation would fail.You may want to query the backup for the metadata (e.g. logical names for data and logs), and then use MOVE command to restore to different locations

While SQLCMD does return errors to the calling process which you could re-direct to Hudson console, you might want to check the status of execution by querying MSDB - ether as a part of same [restoreDB.sql] SQL script as the one that handles restoration, or in a separate Hudson's build step (new db session):

SELECT destination_database_name,max(restore_date) as restored_on
    FROM msdb..restorehistory
        GROUP BY destination_database_name

One can spend time polishing the scripts, adding error handling and safeguards (e.g. wrapping it in a stored procedures, parameterizing inputs etc..) Ultimately, there is a need for a Hudson plugin to encapsulate SQLCMD functionality (I am tempted to take a stab at it myself, time permitting :).

21Jan/100

SQL Server: passing data between procedures

The common programming task - passing parameters between functions - is far from simple in Transact-SQL. One has to pay close attention to a particular version of the RDBMS that implements the language. To add to confusion, ever since Microsoft SQL Server and Sybase had parted their ways (version 7.0 and 11.5, respectively), there are two ever diverging dialects of Transact-SQL.

This article How to Share Data Between Stored Procedures  by  Erland Sommarskog goes into excruciating details explaining different options a programmer has when there is a need to pass data between stored procedure. Saved my team some time, and provided an opportunity to learn. Thank you!

The following table is taken verbatim from the original post by Mr. Sommarskog, and links back to his site:

Method Input/ Output SQL Server versions Comment
Using OUTPUT Parameters Output All Not generally applicable, but sometimes overlooked.
Table-valued Functions Output SQL 2000 Probably the best method for output, but has some restrictions.
Inline Functions Use this when you want to reuse a single SELECT.
Multi-statement Functions When you need to encapsulate more complex logic.
Using a Table In/Out All Most general methods with no restrictions, but a little more complex to use.
Sharing a Temp Table Mainly for single pair of caller/callee.
Process-keyed Table Best choice for many callers to same callee.
Global Temp Tables A variation of Process-Keyed.
INSERT-EXEC Output SQL 6.5 Does not require rewrite. Has some gotchas.
Table Parameters and Table Types In/(Out) SQL 2008 Could have been the final answer, but due to a restriction it is only mildly useful in this context.
Using the CLR Output SQL 2005 Does not require a rewrite. Clunky, but is useful as a last resort when INSERT-EXEC does not work.
OPENQUERY Output SQL 7 Does not require rewrite. Tricky with many pitfalls.
Using XML In/Out SQL 2005 A roundabout way that requires you to make a rewrite, but it has some advantages over the other methods.
Using Cursor Variables Output SQL 7 Not recommendable.