Implementing Search with django CMS 4

Aug. 30, 2024

In this blog post I'd like to walk through the necessary steps to get started with a custom search index for a fairly typical django CMS website project.

Introduction

Having search (or: being able to search through a website's content) is a common requirement in most, if not all, modern websites. At least that's what I regularly experience when building websites for customers.

However, the type of search is key here. If you are building, say, a customer portal and you find yourself needing to implement search such that you can filter some lists, type into some select boxes to reduce the number of possible options or something like this, then it is quite a different approach than building a search index for dynamic content on djangoCMS sites.

Let me explain…

The Easy Path

Let's say you're building a typical CRUD-like app. Maybe you have a Customer model and want to allow the user to filter the list by typing the name, city, category or other field values of this model.

Doing that is straightforward. All you need for this are Q objects, .filter() and, if you feel super fancy and use PostgreSQL, __search lookups, SearchVector, SearchRank and possibly even lookups such as __trigram_similar. If you don't already know those, you can have a look at:

This is probably sufficient for >90% of use cases in CRUD apps. However, if we talk about django CMS, things can get a bit more complex…

The Issue

A django CMS site is not just a simple and static collection of lists, forms and some actions between them. django CMS not only allows but encourages us to build a website just like we write our code: decoupled, using components, and allowing reuse with insane amounts of customization for the content creators of your website.

This is good actually, it allows your customers to shape their website in any way they want. You just set up the correct building blocks, style them inside their own little componentized scope, and they drop them into sites wherever they like, using content that fits the use case and context and without any need to call you!

However, this also means that allowing users to search for content is a bit more difficult than querying single model instances and using .filter() calls.

This is simply because the content can be in any shape or form. We can't know what to query for. Even worse, every single plugin can have drastically different structures. We just can't use the builtins of Django like we could for our models in the example above.

The Solution

What we need to do is render the page content. But we can't really do that just in time when the website visitor is searching for stuff, since that would take too long.

We could somehow pre-render the page ahead of time, store the text somewhere and query against it instead. We need something called a search index!

Luckily, the Django ecosystem has a package called django-haystack that handles all the nitty-gritty details of maintaining search indexes for us. It allows us to hook into every step along the way, provides management commands and more. Let's see how djangocms-haystack uses it!

Before we get started: have a look at the Django Haystack documentation and maybe even at a simple search index that was already built:

Implementation

By default, we need to define our indexes by extending the provided base class and place our class inside a file called search_indexes.py. It also needs to be created inside an app that is in your INSTALLED_APPS or it won't get discovered.

If we do that, our class could look something like this:

`search_indexes.py`

from cms.models import PageContent 
from django.db.models import Q from haystack import indexes
from djangocms_haystack.base import BaseIndex


class PageContentIndex(BaseIndex, indexes.Indexable):
    def get_url(self, instance):
        return instance.page.get_absolute_url()

    def get_title(self, instance):
        return instance.title

    def get_description(self, instance):
        return instance.meta_description or ""

    def get_model(self):
        return PageContent


    def get_index_queryset(self, language):
        return PageContent.objects.filter(
            Q(redirect__exact="") | Q(redirect__isnull=True),
            language=language,
        )

As you can see, this only works because we inherit from a custom-built base class in djangocms-haystack. This one is where most of the magic happens and looks something like this:

`base.py`

from typing import Optional

from cms.models import CMSPlugin, Placeholder
from django.conf import settings
from django.contrib.contenttypes.models import ContentType
from django.core.handlers.wsgi import WSGIRequest
from django.db import models
from django.utils import translation
from haystack import indexes

from djangocms_haystack.helpers import get_plugin_index_data, get_request


class BaseIndex(indexes.SearchIndex):
    text = indexes.CharField(document=True, use_template=False)
    language = indexes.CharField()
    description = indexes.CharField(indexed=False, stored=True, null=True)
    url = indexes.CharField(stored=True, indexed=False)
    title = indexes.CharField(stored=True, indexed=False)

    def get_request_instance(self, language: str) -> WSGIRequest:
        return get_request(language)

    def get_language(self) -> Optional[str]:
        index_connection = self._haystack_connection_alias

        """
        The 'default' haystack connection alias
        is used for the configured default language
        """
        if index_connection == "default":
            return settings.LANGUAGE_CODE

        """
        If connection_alias is also a key inside
        the configured languages, use it instead
        """

        if index_connection in [lang[0] for lang in settings.LANGUAGES]:
            return index_connection

        """
        Block indexing of content to not pollute index
        """
        return None

    def index_queryset(self, using: Optional[str] = None) -> models.QuerySet:
        self._haystack_connection_alias = using
        language = self.get_language()
        if not language:
            return self.get_model().none()
        return self.get_index_queryset(language)

    def get_model(self) -> models.Model:
        raise NotImplementedError

    def get_url(self, instance: models.Model) -> str:
        raise NotImplementedError

    def get_title(self, instance: models.Model) -> str:
        raise NotImplementedError

    def get_description(self, instance: models.Model) -> str:
        raise NotImplementedError

    def get_index_queryset(self, language: str) -> models.QuerySet:
        raise NotImplementedError

    def get_plugin_queryset(self, language: str) -> models.QuerySet[CMSPlugin]:
        return CMSPlugin.objects.filter(language=language)

    def get_plugin_search_text(
        self, base_plugin: CMSPlugin, request: WSGIRequest
    ) -> str:
        plugin_content = get_plugin_index_data(base_plugin, request)

        # filter empty items
        filtered_plugin_content = filter(None, plugin_content)

        # concatenate the final index string for the plugin
        return " ".join(filtered_plugin_content)

    def get_placeholders(
        self, instance: models.Model, *args: list, **kwargs: dict
    ) -> models.QuerySet[Placeholder]:
        content_type = ContentType.objects.get_for_model(instance)
        return Placeholder.objects.filter(
            object_id=instance.pk, content_type=content_type
        )

    def get_search_data(
        self, instance: models.Model, language: str, request: WSGIRequest
    ) -> str:
        placeholders = self.get_placeholders(instance)
        if not placeholders:
            return ""

        plugins = self.get_plugin_queryset(language).filter(
            placeholder__in=placeholders
        )
        content = []

        for base_plugin in plugins:
            plugin_text_content = self.get_plugin_search_text(base_plugin, request)
            content.append(plugin_text_content)

        if getattr(instance, "page", None):
            page_meta_description = instance.page.get_meta_description(
                fallback=False, language=language
            )

            if page_meta_description:
                content.append(page_meta_description)

            page_meta_keywords = getattr(instance.page, "get_meta_keywords", None)

            if callable(page_meta_keywords):
                content.append(page_meta_keywords())

        return " ".join(content)

    def prepare_fields(
        self, instance: models.Model, language: str, request: WSGIRequest
    ) -> None:
        self.prepared_data["language"] = language
        self.prepared_data["url"] = self.get_url(instance)
        self.prepared_data["title"] = self.get_title(instance)
        self.prepared_data["description"] = self.get_description(instance)
        self.prepared_data["text"] = (
            f"{self.prepared_data['title']} {self.prepared_data['text']}"
        )

    def prepare(self, instance: models.Model) -> dict:
        current_language = self.get_language()
        if not current_language:
            return super().prepare(instance)
        with translation.override(current_language):
            request = self.get_request_instance(current_language)
            self.prepared_data = super().prepare(instance)
            self.prepared_data["text"] = self.get_search_data(
                instance, current_language, request
            )
            self.prepare_fields(instance, current_language, request)
            return self.prepared_data

As you can see, this class is looking quite a bit more complex than the first one. This is so that we can easily extend it and reuse it for many other models in our app as well. I'll explain this shortly, but for now, I'd quickly like to say that all helper functions will not be included here to not bloat the blog post. You can have a look at the functions that actually render and sanitize the plugin content here.

Now, let me show you the necessary settings that each project using django-haystack needs to provide:

`settings.py`

...

HAYSTACK_CONNECTIONS = {
    "default": {
        "ENGINE": "haystack.backends.whoosh_backend.WhooshEngine",
        "PATH": "whoosh_index/de",
        "TIMEOUT": 60 * 5,
        "INCLUDE_SPELLING": True,
    },
    "en": {
        "ENGINE": "haystack.backends.whoosh_backend.WhooshEngine",
        "PATH": "whoosh_index/en",
        "TIMEOUT": 60 * 5,
        "INCLUDE_SPELLING": True,
    },
}

CMS_LANGUAGES = {
    1: [
        {"code": "de", "name": "German"},
        {"code": "en", "name": "English"},
    ]
}
LANGUAGES = (
    ("de", "German"),
    ("en", "English"),
)
LANGUAGE_CODE = "de"

...

Explanation

I encourage you to set up a simple Django app, install django-haystack and copy the code above to tinker with it on your own. Trying things on your own is always way more powerful than any words I could write.

That being said, let's get some feeling for the code anyway: django-haystack gives us a pretty sane structure that we can implement when inheriting from the indexes.SearchIndex base class. As mentioned above, this is done through another layer of base classes that we custom-built so that we can gain even more control and abstract away more things later if we want to reuse the index for other models.

One other key aspect is the so called preparation stage of Haystack.

They have it explained in great detail in their docs, but this is where the prepare function in our custom base class comes into play. If we indexed a basic model and model fields, we could do so with an easy index and model_attr attributes on each field. However, we need more sophisticated processing, and this is precisely why we can hook into this preparation stage and alter the final prepared_data exactly how we like.

As you can see, we render all our plugins in there (using a helper function that you can look at inside the GitHub repo) and assign the returned values to our text field, which will later get indexed and serves as our primary field that gets searched through. One important thing that I want to emphasize: the rendering itself is not enough. One also needs to sanitize the resulting HTML string properly, not only to have clean text and get better search results, but also to prevent injection attacks when displaying the result set to your users. This obviously depends on your use case and what you plan to do, but I wanted to emphasize this anyway.

Another important bit is the use of our HAYSTACK_CONNECTIONS. As you can see, we define two of them: default and en. This may sound counterintuitive (and it is, really) but we need to have one index with a key name of default. This is asserted by Django Haystack during the application startup. So, to prevent an empty index, we use the default index to store our content of our site's default language. The default language is defined as the one in your settings' LANGUAGE_CODE variable.

The current active index is either passed implicitly when using management commands to maintain our index (more on that in the next section) or can be passed explicitly by us when constructing a SearchQuerySet – again, more on that later.

We can then use the key name of our index as the language code and use it to index content in different languages depending on what index we are currently updating/rebuilding or what index we want to search through. You can see that inside the get_language method inside our base class. We get to the used index because it gets passed to our index_queryset method, and inside that we persist the current index key on the instance as _haystack_connection_alias. We can then use self._haystack_connection_alias inside the get_language which is called in prepare and passed onto the different method that we actually use to extract the content.

So, with all of that said, let's keep going...

Building the index

There are some options to build the index, but the simplest one is the built-in management command you can invoke with python manage.py rebuild_index.

For all your periodic updates, you can use python manage.py update_index --remove – the --remove flag removes all the objects that are no longer present in the queryset, so this will actually delete pages that were once indexed but are no longer present in your CMS. Like built-in housekeeping, very handy & useful.

Depending on how you set up your HAYSTACK_CONNECTIONS you will then find your indexes created for each connection. If you are using Whoosh (which is used in the example) you will see it creates one file for each connection. In our case, one file for each language, since we use the connections as an abstraction over our languages.

Querying the index

Once the index is ready, it's time to query the data! This can be done by using Haystack's SearchQuerySet API. It is pretty straightforward and won't need much explanation. You can read about it in the docs.

Displaying the results

This is by far the easiest part. Write a view that accepts some search query, maybe even some optional filters or other search criteria depending on how your index is being built, and then construct the necessary SearchQuerySet. This is the place where you could also filter the indexed data by passing in your language code as the using parameter to the SearchQuerySet like this: qs = SearchQuerySet(using="en"). If you don't want to deal with building the queries against the index yourself, feel free to use the ModelSearchForm that is included in Haystack. Then pass that into your template, build the HTML layout and style it accordingly. That's it. This is all up to you and how you want your search to look and feel like. There really is nothing I can say here, this really depends on what project you're working on.

What's next?

So, now you should have all the necessary building blocks for a basic search. Now the real fun begins, and you can tweak it however you like. One of my favourite features is the built-in highlighting. You can add a .highlighted() call to your SearchQuerySet and the result will have the matched string surrounded by spans with a customizable class name.

You can then style the found words differently and add some focus to the search results.

Another great feature is autocomplete. This makes your search behave more lazily, since you don't need full-text hits to return some results that match the query. If you've ever worked with something like the PostgreSQL trigram_similar filters, you may know that this approach still has some downsides. It uses ngrams to match your query against the content and should, in theory, catch small typos but still return the correct results. This may work in many cases, in other cases it fails. So, at least in my experience, it can be quite frustrating to work with and have it yield reliable results. I may have done it wrong, though, so have a look at it!

If you really want to step up the game, you can even boost single words or whole documents. Maybe you're building something for content creators or something like Etsy or eBay and want them to be able to highlight their work in exchange for a small fee. You could boost their products using this feature and add a bit more relevance to their items when being searched for something similar.

Keep building, have fun & thank you so much for reading.

About Filip Weidemann

Filip Weidemann is the author of djangcms-haystack.

djangocms-haystack is a Django application that provides a ready-to-use Haystack search index for django CMS very similar to the one discussed here. It simplifies the integration of search functionality within django CMS projects by providing the base class that can also be used to add search to other django CMS apps, their placeholders and plugins.

PyPi: https://pypi.org/project/djangocms-haystack/
Github: https://github.com/Lfd4/djangocms-haystack