Writing Efficient Queries in Drupal

Praxent

11 years ago

Writing Efficient Queries in Drupal

Authored by Steve Williams

Are your Drupal views slow? Do your clients forget what they are searching for? Have your clients stopped using a feature, dashboard, or report because of performance? After reading this post, you will learn how to convert your slow queries into efficient queries and dramatically improving render time while satisfying your clients.

The Landscape: Current State of Data Structure in Drupal

Drupal stores its data in a relational database like MySQL, MariaDB, and PostgreSQL. Drupal 7 introduced a new data schema separating each field into its own table creating a normalized storage system. The advantage to having each field in its own table is the flexibility to relate a single field, and its schema, to several content types. The disadvantage to a normalized relational approach is having to join the tables together for analysis and rendering.

Use Drupal Views as Intended

Views is a highly popular module in the Drupal community. Views is so popular most people will not start development on Drupal without it. So popular it will be integrated into Drupal 8 core. Views is amazing, but it is a query framework that limits your ability to optimize and create efficient queries. Views is intended for quick development and easy deployment, but for large, specialized data sets with complicated relationships you will find you need to move your views into code for more control. Before abandoning views consider using views hooks like hook_view_query_alter() & hook_view_query_substitutions() to alter and substitute queries before execution.

Be Efficient, Precise and Direct

Define efficient fields by limiting their size to only what is required. This will reduce table size, index size and scanning time. Never request all fields “*” when you don’t need all fields. Always be precise about what fields you want returned. Avoid joining several tables when you can directly join a table containing the needed value. This requires thinking through your data structure and join options.

Use API’s as intended

When dealing with API’s that load complete objects, never call them during iteration. Examples of Drupal’s APIs that load complete objects are node_load, user_load and entity_load. These APIs are intended for single item returns and will perform queries to return the complete object. Especially avoid using these APIs when looping through SQL returns in order to attach a field during iteration. These inefficient short cuts are usually written when a developer is short on time.

Avoid Joining on Different Column Types

Joining on columns of different types causes indexes to be ignored. If you are trying to join a string column to an integer column ask yourself why. Not only are the queries not able to use available indexes, but you may have a schema issue that needs addressing. Use “explain” to identify issues with your query.

Evaluate Joins

With Drupal designed to store each entity field in its own table, the join count grows in equal relation to the field count. If you have a content type with 10 fields you end up with 10 joins, counting a join on the user table for authorship. If you have a query that relates two content types, your join count grows exponentially. If your join count grows, how will this impact the query response time? Let’s take a look at a simple query of the node field in Drupal and add a second join, then a third.

No Join

select nid FROM node;    200005 rows in set (0.08 sec)

One Join

select n.nid, s.field_school_name_format     FROM node n     LEFT JOIN field_data_field_school_name s ON n.nid = s.entity_id;    200005 rows in set (1.09 sec)

Two Joins

select n.nid, s.field_school_name_format, si.field_school_id_value     FROM node n     LEFT JOIN field_data_field_school_name s ON n.nid = s.entity_id     LEFT JOIN field_data_field_school_id si on n.nid = si.entity_id;    200005 rows in set (2.03 sec)

Three Joins

select n.nid, s.field_school_name_format, si.field_school_id_value,     sp.field_school_phone_value     FROM node n     LEFT JOIN field_data_field_school_name s ON n.nid = s.entity_id     LEFT JOIN field_data_field_school_id si on n.nid = si.entity_id     LEFT JOIN field_data_field_school_phone sp on n.nid = sp.entity_id;    200005 rows in set (5.03 sec)

The difference between our first query without any joins and our last query with three joins is significant! Our problem is actually larger than it seems because our content type of school has 10 fields related. How can we make this query more efficient?

Note: The ANSI-89 syntax for joining is optimized the same as the ANSI-92 syntax and will not result in a performance gain. Use the ANSI-92 syntax to separate your “where” conditions and define your order of joins.

ANSI-89 example:

select nid, field_school_name_format     FROM node n, field_data_field_school_name s where n.nid = s.entity_id;

Move Joins from Main Query to Manual Joins on Iteration

Consider moving joins from your main query to separate functions called during iteration. Not only does this reduce the dependency (index scanning) in your main query, but it limits the amount of data you are calling to your limit set. If you are limiting your result to a range of 10 with a total return of 2,000,000 you are being more precise about what data you need during each range (pagination). Remember to be Efficient, Precise, and Direct.

When viewing the rendered query of all fields on the school content type, we see the total render time of 7 seconds. 7 seconds to display 24 items out of 200,000. What’s interesting is when the user clicks the pagination to view the second page it takes another 7 seconds to query and render.

Original rendition timed at 7.731 seconds

note: The table structure is similar to a Praxent State funded Education project, but the data was randomly filled for testing. Second rendition timed at 0.4192 seconds

In this situation removing joins from the main query and performing manual joins during iteration reduces query by 200%, or 0.4 seconds.

In more complex joins between tables that are not as one-to-one, you will see even greater improvements. Now is a good time for your own thought experiment. Consider writing queries for joins across two or three content types, or a query that scans taxonomies and joins content to taxonomies. Now consider joining your data across all 200,000 results, or across 24 results.

First rendition with all tables joins and fields defined.

$query = db_select('node', 'n');      $query->LeftJoin('field_data_field_school_name', 's', 'n.nid = s.entity_id');      $query->LeftJoin('field_data_field_school_address', 'sa', 'n.nid = sa.entity_id');      $query->LeftJoin('field_data_field_school_city', 'sc', 'n.nid = sc.entity_id');      $query->LeftJoin('field_data_field_school_phone', 'sp', 'n.nid = sp.entity_id');      $query->LeftJoin('field_data_field_school_fax', 'sf', 'n.nid = sf.entity_id');      $query->LeftJoin('field_data_field_school_id', 'sid', 'n.nid = sid.entity_id');      $query->LeftJoin('field_data_field_school_ranking', 'sr', 'n.nid = sr.entity_id');      $query->LeftJoin('field_data_field_school_size', 'ss', 'n.nid = ss.entity_id');      $query->LeftJoin('field_data_field_school_zip', 'sz', 'n.nid = sz.entity_id');      $query->fields('n', array('nid', 'title'))            ->fields('sid', array('field_school_id_value'))            ->fields('s', array('field_school_name_value'))            ->fields('sa', array('field_school_address_value'))            ->fields('sc', array('field_school_city_value'))            ->fields('sp', array('field_school_phone_value'))            ->fields('sf', array('field_school_fax_value'))            ->fields('sr', array('field_school_ranking_tid'))            ->fields('ss', array('field_school_size_tid'))            ->fields('sz', array('field_school_zip_value'))            ->condition('n.type', 'school', '=');      $query=$query->extend('PagerDefault')            ->limit(24);      $result = $query->execute();

Second rendition with no joins and manual joins called during iteration.

  $query = db_select('node', 'n');      $query->fields('n', array('nid', 'title'))            ->condition('n.type', 'school', '=');      $query=$query->extend('PagerDefault')            ->limit(24);      $result = $query->execute();        // Iterate through results.      foreach ($result as $record) {        $output_rows[] = array(          $record->nid,          $record->title,          example_school_id($record->nid),          example_school_name($record->nid),          example_school_address($record->nid),          example_school_city($record->nid),          example_school_phone($record->nid),          example_school_fax($record->nid),          example_school_rank($record->nid),          example_school_size($record->nid),          example_school_zip($record->nid),        );      }

Write Queries with Dynamic Filters

Don’t let exposed filters force you into writing large queries with several joins. Instead write dynamic queries that join tables and fields depending on the filter submitted. Thanks to Drupal’s object query builder built on PHP’s PDO you can write clean maintainable, dynamic queries.

$query = db_select('node', 'n');      $query->fields('n', array('nid', 'title'));      if (!empty($filter->school_name)) {        $query->LeftJoin('field_data_field_school_name', 's', 'n.nid = s.entity_id');        $query->fields('s', array('field_school_name_value'));      }      $query->condition('n.type', 'school', '=');      $query=$query->extend('PagerDefault')            ->limit(24);      $result = $query->execute();

Additional Options

Beyond writing more efficient queries you can also write queries that connect to databases in parallel, or point queries to secondary master/slave databases. Separating your joins only provides more options for parallel and master/slave connections.

Denormalized Data in Custom Tables

If you’re still not able to produce the needed results, you may consider storing a version of your data in a denormalized table. Add, update, and delete the denormalized data with a transaction or with hook_updates. The down side to this approach is the additional processing and time taken to perform writes. If write performance is a concern, you may consider NoSQL solutions.

Conclusion

By moving your views into code; writing efficient, precise and direct queries; avoiding joins that will not use indexes; using APIs as intended and performing manual joins you can greatly improve performance and provide features your clients will actually use. It can take time to write efficient, precise, and direct queries, but the results are the reward.