Indexing

Estimated reading: 4 minutes

You often want to search for specific words or a sentence in potentially large documents or data content. As a normal search of the entire text becomes increasingly slower with the size/number of texts, the texts can be converted into a search-friendly form in advance if indexing is activated. This calculation can be carried out every night, for example, to achieve a considerable speed advantage in the search the following day.

Activation

Indexing is only possible for documents and long fields. For a long field, the “Indexing” option must also be activated in the table. A one-off or periodic execution can be started in the system jobs.

Functionality

Conversion to text

Text is first extracted from files. The pdftotext tool is used for Pdf files. For text files, the html2text tool is used if the use_html2text environment variable is activated. The Apache Tika tool can optionally be used for other files. To do this, indize_tika must be activated in the environment variables, Java must be installed and the file /opt/tika/tika.jar must exist.

Limbas indexing

From the resulting text, each word is inserted once into the lmb_indize_w table. If the word originates from a file, this context is stored in the lmb_indize_fs table. If, on the other hand, a long field is currently being indexed, the word is linked to the corresponding data record in the lmb_indize_ds table.

If you now search for a word, you can easily find out where it occurs using the tables.

Postgres indexing

Many databases now offer self-indexing. These are usually well optimized so that the full performance of the database can be used. This also supports additional functionality, such as searching for an entire record. Indexing of the PostgreSQL database has already been integrated into Limbas.

Indexing

Postgres indexing can be configured in the environment variables. To activate Postgres indexing (and deactivate Limbas indexing), the option postgres_use_fulltextsearch must be set.

To keep the search effort to a minimum, texts are broken down into an alternative form. For example, stop words are removed or nouns are reduced to their word stem. The position of the words is also saved. The sentence “The mouse is in the house” becomes “‘house’:5 ‘mouse’:2”. This procedure depends on the language of the text, which can be set in postgres_indize_lang. If all texts are in the same language, this can be entered directly (e.g. “german”). If you want the database to decide for itself which language it is, you can either enter several languages (e.g. “german,english,french”) or “all” (for all supported languages). The more languages you enter, the longer the indexing will take.

Saving the word positions is important if you want to search for a whole sentence, but it also takes up more memory in the database. If you only want to search for words anyway, you can prevent the positions from being saved with the postgres_strip_tsvector option.

Search performance

If the search is too slow, it can be accelerated with a database index.

For files:

CREATE INDEX <index-name> ON LDMS_FILES USING GIN (FTS_VECTOR);

For long fields:

CREATE INDEX <index-name> ON <table-name> USING GIN (<field-name>_FTS_VECTOR);

Result

If you search in several data records, the result can be sorted according to how well the data records found match the search query by setting the postgres_rank_func setting to “TS_RANK” or “TS_RANK_CD”. The difference is described in the Postgres documentation and is not explained further here. The setting can be left blank to avoid sorting the result and thus speed up the query.

If the search term was found, you want to see a section of the text around the term found. If the postgres_headline option is activated, you will get better results with a loss of speed.

Indexing

Activation

Functionality

Conversion to text

Limbas indexing

Postgres indexing

Indexing

Search performance

Result

Indexing

CONTENTS