Using LINQ (Language Integrated Queries) in Drupal or how to write queries x5 faster

Using LINQ (Language Integrated Queries) in Drupal or how to write queries x5 faster

Difficulty: 
Come Get Some

Introduction

In this article I will show you how to use an enterprise ready technology (LINQ) to write your Drupal queries x5 faster, with improved readabilty, usability and maintainability.

LINQ is basically Queries in Code - not queries in strings in code - The advantages of having queries integrated in the language are, among others:

  • Strongly typed database schema. That means no more mistakes while writing plain SQL. If you improperly write the name of a table or a field the query simply won't compile at design time. 
  • No magic strings.
  • Intellisense (AKA Autocomplete). Explore the database tables and fields and get hints while writing the query.
  • Faster development.
  • Reusable abstract expressions (lambdas and functions)

One of the big problems with Drupal is that as your requirements and complexity grow both the database abstraction layer and the query builder start to shine for their lack of productivity features (it's not just Drupal but the whole PHP ecosystem). But in Drupal this problem is worse because the data model is extremely decoupled, loose and above all complex. That was the price we had to pay to get revisions, translations and multi value fields all over an abstracted layer that can provide compatibility for multiple database engines.

Before we go any further here is what our sample use case for this article looks like in Drupal VS LINQ. This is a very simple query, but it already looks a mess in Drupal (imagine what a really complex query looks like, you are better off writing plain SQL).

// Nodos del tipo inscripción.
$query = db_select('node','inscription');
$query->condition('inscription.type','inscripcion','=');
// Nos traemos el campo congreso
$query->join('field_data_field_congreso','field_congreso','field_congreso.entity_id = inscription.nid');
// Nos traemos la actividad congreso
$query->join('node','actividad','field_congreso.field_congreso_nid = actividad.nid AND actividad.type = :node_type_congreso', array(':node_type_congreso' => 'congreso'));
// Nos traemos el campo mail 
$query->join('field_data_field_email','field_email','field_email.entity_id = inscription.nid');
// Nos traemos al propietario
$query->join('users', 'inscription_owner', 'inscription_owner.uid = inscription.uid');
// Aplicamos criterio autor y actividad
$or = db_or();
$or->condition('inscription.uid', '', 'IS NULL');
$or->condition('inscription.uid', 1, '=');
$query->condition($or);
// El autor debe tener el mail indicado
$query->condition('field_email.field_email_email', $mail, 'LIKE');
// Nos traemos el ID de inscripción
$query->addField('inscription', 'nid');
$result = $query->execute()->fetchAll();

And this is what that same query looks like in LINQ (it's a GIF animation to show off the power of autocomplete):

No more typing errors, runtime malformed queries or having to look up every 5 seconds what the over engineered Drupal database schema looks like.

This is all cross platform compatible:

  • The queries are run on EF (Entity Framework) that can plug on top of SQL Server, MySQL, Oracle and others.
  • This all runs on top of the .Net framework, that is already working on Linux thanks to Mono and will be natively supported soon thanks to Microsoft.
  • You can interoperate with the LINQ queries from PHP either by means of the NetPHP libraries, or with Phalanger (both of those links point to Github repositories).

Setting up the project and folder structure

To support this from Drupal we will assume that any additional .Net code is going to be customer specific, so we are going to place it into the customer's customization module.

We are going to create a new "net" folder with two subfolders "bin" and "src". The bin is going to be the place where we will be deploying the compiled binaries, and the "src" will contain the Visual Studio project's and source code.

Because we want to keep the data model binaries separate from the actual code we will be creating 3 projects:

  • One containing the Entity Framework / Linq-To-Sql data clases and any custom database acces related logic
  • One containing all our custom code and queries
  • One containing .Net tests

Your solution should end up looking something like this:

Communicating .Net and PHP

The first thing we want to do is to bring the NetPhp technology into our project. This is a bridge that uses PHP's com_dotnet extension so bring into PHP the best of .Net.

We are using the great Composer Manager module to manage module dependencies on external libraries.

Add a composer.json to your module's directory that brings in the NetPhp php libraries:

{
    "require": {
        "drupalonwindows/netphp": "dev-master"
    }
}

Regenerate the global composer.json file either from the UI or running the drush command "composer-json-rebuild".

Navigate into sites/default/composer (or the location you have configured in composer manager) and run a composer install command.

Download the NetPhp binary library, and drop it inside your PHP installation directory (where the php.exe and php-fastcgi.exe files are located).

For this demo project we are going to put all our queries into the DrupalCustom.DrupalQueries class library project in C#. We need a way to retrieve an instance of that class, and make it aware of the database connection settings.

So we came up with this manager to alows us to retrieve the DrupalCustom.DrupalQueries instance whenever we need it and populated with the database connection options.

namespace Drupal\mymodule;

use \NetPhp\Core\NetManager;

class QueryManager {
  // @var NetManager $manager
  private static $manager = NULL;
  private static $DrupalQueries = NULL;

  public static function GetManager() {
    global $databases;
    if (static::$DrupalQueries== null) {
      static::$manager = new NetManager();
      $dll = DRUPAL_ROOT . '\\' . str_replace('/', '\\', drupal_get_path('module', 'mymodule') . '\\net\\src\\DrupalCustom\\bin\\Debug\\DrupalCustom.dll');
      static::$manager->RegisterAssembly($dll, 'DrupalCustom');
      static::$DrupalQueries = static::$manager->Create('DrupalCustom', 'DrupalCustom.DrupalQueries')->Instantiate();
      $conn = $databases['default']['default'];
      static::$DrupalQueries->InitDataContext('6470B', $conn['database'], $conn['username'], $conn['password']);
    }
    return static::$DrupalQueries;
  }
}

This is just a singleton like pattern that will retrieve an instantiated and database aware COM instance of our DrupalCustom.DrupalQueries class.

Let's see how it works with more detail.

To instantiate a .Net class from PHP you need to use a copy of the NetManager class. NetManager takes care of managing assembly and class namespaces:

static::$manager = new NetManager();

Then we must register the location of our binary file in the NetManager and give it an alias. Because we are on a development environment we will be using the /bin/Debug/DrupalCustom.dll file, but on production you might want to change this to another location. Using this location allows us to simply rebuild the .Net project and have everything updated real time.

$dll = DRUPAL_ROOT . '\\' . str_replace('/', '\\', drupal_get_path('module', 'mymodule') . '\\net\\src\\DrupalCustom\\bin\\Debug\\DrupalCustom.dll');
// Register the binary with DrupalCustom alias
static::$manager->RegisterAssembly($dll, 'DrupalCustom');

We are now ready to make an instance of our .Net class by specifying the assembly alias we registered before, and the Full Qualified Name of the class:

static::$DrupalQueries = static::$manager->Create('DrupalCustom', 'DrupalCustom.DrupalQueries')->Instantiate();

The next step is to make this query manager aware of what database is it going to use to run the queries on. We've implemented an InitDataContext that stores internally the database information into the query management class:

$conn = $databases['default']['default'];
static::$DrupalQueries->InitDataContext($conn['host'], $conn['database'], $conn['username'], $conn['password']);

That's it. From now on whenever you want to call a query that is stored in our .Net DrupalQueries class you simply need to call GetManager():

$result = QueryManager::GetManager()->MyCustomQueryOrMethod(param1, param2, param3);

The productivity edge

You might be thinking that the 5 fold claim in improved query development performance is an exaggeration. That is far from true.

Let's see what kind of Drupal specific magic we can leverage from LINQ besides having strongly typed queries, type hinting, autocomplete and design time validation of the query.

One of LINQ's most powerful feature is the ability to reuse expressions.

Let's say for example that we have the node entity, with a bundle named "inscripcion" and that this bundle has a text based field called "field_cargo". Writing the query in C# looks like this:

var query = (from inscripton in dcs.CurrentContext.node.AsExpandable()
             join cargo in dcs.CurrentContext.field_data_field_cargo on inscripton.nid equals cargo.entity_id
             select cargo.field_cargo_value);

But we know that the data structure of field cargo and the join needed to bring it into the query is going to be always the same. So we are going to encapsulate this into its own reusable expression:

public static Expression<Func<node, IQueryable<field_data_field_cargo>>> load_cargo = (n) =>
    DCScope.Current.field_data_field_cargo.Where((i) => i.entity_id == n.nid).DefaultIfEmpty();

Then the original query can reuse this expression for improved readability and maintainablity:

var query = (from inscripton in dcs.CurrentContext.node.AsExpandable()
             from cargo in NodeFields.load_cargo.Invoke(inscripton)
             select cargo.field_cargo_value);

Even better, you can use this same expression to bring into the query the "field_cargo" as many times as you want and with as many bundles of the same entity you would like to. This was just an example, you can bundle into a single expression as much logic as you wish, such as bringing in a field for an entity reference field or any custom conditions you can come up with.

There are some approaches in Drupal to facilitate querying data, such as the EntityFieldQuery, but it is so limited that it is barely useful for anything but for making your code a little bit more readable.

Performance

We are already leveraging deployment and development gains (design-time) but what are the run-time implications of this change. Is there a performance penalty (or gain) for having to use an additional layer of COM + LINQ instead of drupal's query builder?

We are going to benchmark a sample query under several different scenarios.

This is the query in Drupal:

//nodos del tipo inscripción.
$query = db_select('node','inscription');
$query->condition('inscription.type','inscripcion','=');
//unimos con las actividades (congresos)
$query->join('field_data_field_congreso','field_congreso','field_congreso.entity_id = inscription.nid');
$query->join('node','actividad','field_congreso.field_congreso_nid = actividad.nid AND actividad.type = :node_type_congreso', array(':node_type_congreso' => 'congreso'));
//aplicamos criterio autor y actividad
$query->condition('inscription.uid', $uid, '=');
$query->condition('actividad.nid', $nid, '=');
$query->fields('inscription', array('nid'));
$result = $query->execute()->fetchAll();

And this is the query in C#:

var result = (from inscription in dcs.CurrentContext.node.AsExpandable()
              join field_congreso in dcs.CurrentContext.field_data_field_congreso on inscription.nid equals field_congreso.entity_id
              join congreso in dcs.CurrentContext.node on field_congreso.field_congreso_nid equals congreso.nid
              where inscription.type == "inscripcion"
              && inscription.uid == uid
              && congreso.nid == nid
              select inscription.nid);

return result.FirstOrDefault();

We have verified that both queries output a similar SQL statement and that they are 100% equivalent.

We tested running the query 200 times in a tight loop in 3 different scenarios with the following results:

  Maximum (s) Average(s)
Drupal Query Builder 0.61 0.57
LINQ over Entity Framework 6 1.23 1.12
LINQ over Linq-To-Sql 0.57 0.499

Nothing really surprising, the entity framework is a big bloated piece of an ORM, and even with compiled queries (that come by default with EF6) they are 100% slower than Drupal's query builder or Linq-To-Sql. This being Microsoft Flagship ORM at the moment, we don't understand this results and there is a chance that there is some sort of misconfiguration, but we were not able to find it.

We are also aware that EF6 first compilation is slower (it then stores statically the compiled query in the process), and being PHP a state-less architecture looping 200 times over a query is not a real world scenario. EF6 is already out of this game.

We are going to do an ApacheBench test over a page that bootstraps drupal and runs a single query without concurrency and 1000 times. This is a more real life friendly scenario because the COM object will be instantiated on every request, instead of being reused in a loop.

define('DRUPAL_ROOT', getcwd());
require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
// ----------- TEST 1 LINQ
$queries = Drupal\prevencionintegral\QueryManager::GetManager();
$uid1 = $queries->InscripcionDeUsuario(48577, 24778);
// ----------- TEST 2 DRUPAL QUERY BUILDER
$uid1 = Drupal\prevencionintegral\Managers\ManagerCongresoInscriptions::InscripcionDeUsuario(48577, 24778);

The results showed that there was no appreciable difference in throughput, even taking into account that the LINQ version has much more overhead (instantiating a COM object, communicating  through COM, etc.) as compared to Drupal's query builder.

Our conclusion is that from a performance standpoint there is no penalty when moving queries out of Drupal and into LINQ, actually you get a slight performance edge.

 

Comments

Great article as a proof of concept but practically I really don't see the point. I agree that whilst EntityFieldQuery is not a full ORM I fail to agree with the statement:

"There are some approaches in Drupal to facilitate querying data, such as the EntityFieldQuery, but it is so limited that it is barely useful for anything but for making your code a little bit more readable."

EntityFieldQuery has proven quite adequate in my time with Drupal. I would never go to the extreme of introducing Entity Framework in the mix. Especially knowing how slow and tedious .NET development is. (I've been stuck with doing .NET for the past year now) and the extra points of failure in this mix are certainly worth noting.

I would suggest looking into tying something in like Laravel's Eloqent or Doctrine for a more native and sane solution.

Overall though a great proof of concept.

EntityFieldQuery has proven quite adequate in my time with Drupal.

I've even heard of people that don't use a debugger (such as XDebug) and that echo or print have proven adequate for their projects - no ofense - I'm just trying to say it depends on the type of project.

The problem with the EFQ is that it is mainly designed to query 1 entity and its fields and does not manage complex relationships between entities (in D8 FieldQuery it got join support), nor it allows to retrieve specific property or field values from the entity itself or from any of the related entities. To add up, writing complex conditions (multiple nested ands/ors) is not "easy". So EFQ is really only usable on simple scenarios and is more than sufficient for regular websites.

There has been some work on that though, such as the EntityFieldQueryExtra but still far from being a complete solution.

As you say it doesn't look very sane to have all this technology mixture (like it is not quite sane either to use Drupal for complex applications). We are already integrating .Net to offload things from PHP so that mix is already stable and natural to us, and here we are trying to push the integration a bit further because at the moment we are offloading very isolated functionality.

The discussion is not what you can or cannot do with one approach or the other. On the Drupal side you could write your plain SQL statement and be happy querying and retrievig whatever you want, in the same way you can be happy without a debugger, without QA, without error logging, etc.

The point is productivity, maintainability and overal quality. We don't use PHP or Drupal for the really big projects, but I cannot imagine and would not consider for a really "serious" project anything that was not strong typed, with autocomplete and had compile time query validation. Having a data layer with autcomplete, design time query validation and reusable expressions such as LINQ reduce development and maintenance efforts as much as 10%. Now take a team of 20 people working on the same project full time and on an ongoing basis and that 10% is a big load of money. Reusable expressions make the data layer more robust, newcomers quickly see what the whole data model is like while typing, query errors are completely removed from the equation. And its not just about money, but developer sanity and project quality. None of this will make any sense if in your projects you have 50 or maybe 100 custom crafted queries.

[...] practically I really don't see the point

I haven't seen yet a "big-complex" php project where they were not offloading stuff into C, C++ or Java (because of the slowness and shortcomings of PHP) and the strambotic ways they were using to integrate them - even at a per request basis and using command line approaches (proc_open). Lack of proper threading, slowness, etc. are reasons to move workload to other technology/stacks. Having said that, using COM to interact with .Net binaries is a much more smooth and integrated approach than other things I've seen.

[...] how slow and tedious .NET development is

I don't want to go into the .Net vs PHP argument because, besides not even being the same thing and so cannot be compared (are you comparing the language, the ecosystem, the frameworks, the development tools?) you can have fun for hours browsing the internet reading about it and not finding a single serious and properly argumented analysis of it. Some of our software projects have more than 300,000 LOC writen from scratch and something between 4,000 and 6,000 custom queries and I cannot imagine them being done in anything else than .Net and with the amazing development and maintenance costs the productivity oriented .Net ecosystem gives us.

When the weight of the paperwork equals the weight of the aircraft, the plane is ready to fly.

- standard Boeing engineering rule

I just quoted this because in mission critical or big serious projects, it's not just about getting the thing to work. It's about doing it the best way possible.

Add new comment

By: david_garcia Sunday, April 19, 2015 - 00:00