Journey of Building Code Search Engine

I remember two years ago I was reading some hacker news post about grep.app which is awesome code search which i use a lot at my daily work.

I was so interested about the behind the scene how it work but there is no enough information about it, the only thing i know is from the author post here.

So what I know is it using solr for indexing the source code the rest is unknown. But it is enough for me to start researching how to make it work. I have experience with elasticsearch so solr using lucene like elasticsearch it should be easy to learn.

Installing Apache Solr

Let start with installing the solr, after some days I create this script for installing the Apache Solr.

#!/bin/bash

set -e
SOLR_BUILD_FOLDER="$BUILD_FOLDER/solr"
SOLR_VERSION="8.11.0"
SOLR_DOWNLOAD_URL="https://dlcdn.apache.org/lucene/solr/$SOLR_VERSION/solr-$SOLR_VERSION.tgz"
BUILD_FOLDER="$(pwd)/build"

if ! test -d "$SOLR_BUILD_FOLDER"; then
  mkdir $BUILD_FOLDER
fi

curl $SOLR_DOWNLOAD_URL -o "$SOLR_BUILD_FOLDER/$SOLR_VERSION.tgz"
tar zxvf "$SOLR_BUILD_FOLDER/$SOLR_VERSION.tgz"
mv ./solr-$SOLR_VERSION $SOLR_BUILD_FOLDER

# Install java if not exists
if !(command -v java); then
  sudo add-apt-repository ppa:openjdk-r/ppa
  sudo apt-get update
  sudo apt install openjdk-11-jdk -y
fi

And starting the solr server is also easy just run this script.

sudo bash $SOLR_BUILD_FOLDER/solr-$SOLR_VERSION/bin/solr start -p $SOLR_PORT -force

At this time we are ready to hacking the solr, start to indexing the source code. But, after reading the documentation is so frustrating to setup the indexing schema using xml.

So I take a break for about one week.

Defining the Schema

After reading the Documentation again i found that for Apache Solr version > 8.11 we can use the API to create the schema. Here is an example to create schema using the REST API.

curl --request POST \
    --url "$SOLR_BASE_URL/solr/heline/schema" \
    --header 'Content-type: application/json' \
    --data '{
    "add-field": {
      "name": "branch",
      "type": "string",
      "stored": true
    },
    "add-field": {
      "name": "path",
      "type": "text_general",
      "stored": true
    },
    "add-field": {
      "name": "file_id",
      "type": "string",
      "stored": true
    },
    "add-field": {
      "name": "owner_id",
      "type": "string",
      "stored": true
    },
    "add-field": {
      "name": "lang",
      "type": "string",
      "stored": true
    },
    "add-field": {
      "name": "repo",
      "type": "string",
      "stored": true
    },
    "add-field": {
      "name": "content",
      "type": "text_general",
      "stored": true,
      "indexed": true
    }
  }'

But there is an issue with this approach the search result will be displayed as html component with a nice line, with this approach we can get nice highlighted code with line numbers.

Scraping Github

My first attempt was to scraping the github page which is the code already highlighted. Luckily i have an experience to scraping with go. I work on the scraping for about one week.

I had some issue here, when I scrape the source code from github it is fine when the length of the source code is under 5k charaters, but when it more than that it will slow the process of storing the text to solr.

So to handle this scenario I make the tweak on the schema by changing the index of content to array or multi valued so the schema will be updated like this.

...

    "add-field": {
      "name": "content",
      "type": "text_general",
      "multiValued": true,
      "stored": true,
      "indexed": true
    }

So problem solve, now we can store the source code as a chunk with length of chars under 5k. But we have another problem now, the text that stored to solr was html string, which is will be and issue when we are storing html source code it self and the search result will be a mess.

Creating a Type

The search result is a mess right now, it is because we use text_general type on field content. That type usually to index the normal text not html string. So to fix that issue we need to strip the html tag and only store the source code it self but we still receive the html string on the search result.

Luckily we can easily create our own type, so we will name it text_html. And func fact I spend about one week to find the best combination of analyzer, token and filter for this type. This is not perfect but it is enough to index source code.

curl --request POST \
    --url "$SOLR_BASE_URL/solr/heline/schema" \
    --header 'Accept: application/json' \
    --header 'Content-type: application/json' \
    --data '{
    "add-field-type": {
      "name": "text_html",
      "class": "solr.TextField",
      "positionIncrementGap": "100",
      "autoGeneratePhraseQueries": "true",
      "analyzer": {
        "charFilters": [
          {
            "class": "solr.HTMLStripCharFilterFactory"
          }
        ],
        "tokenizer":{
          "class": "solr.WhitespaceTokenizerFactory"
        },
        "tokenizer":{
          "class": "solr.NGramTokenizerFactory"
        },
        "filters": [
          {
            "class":"solr.WordDelimiterFilterFactory"
          },
          {
            "class": "solr.LowerCaseFilterFactory"
          },
          {
            "class":"solr.ASCIIFoldingFilterFactory"
          }
        ]
      },
      "query": {
        "tokenizer": {
          "class": "solr.WhitespaceTokenizerFactory",
          "rule": "java"
        },
        "filters": [
          {
            "class":"solr.WordDelimiterFilterFactory"
          },
          {
            "class": "solr.LowerCaseFilterFactory"
          },
          {
            "class":"solr.ASCIIFoldingFilterFactory"
          }
        ]
      }
    }
  }'