Searching and Indexing

Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document.

Summary

Things to Remember

Indexes often lead to the desired information more quickly. With searching, a results set is presented and each result must be viewed to see if it is the information or a red herring or false positive.

MCQs

No MCQs found.

Subjective Questions

No subjective questions found.

Videos

No videos found.

Searching and Indexing

Full-Text Search

Full-text search refers to the searching techniques for a single computer - stored document or collection in a full-text database. It is a searching for a group of keywords in a pile of texts like document or query.

In the database, data used for fulltext search can be huge. Full text search is searching for

a set of keywords in test field of a database table
indexing words and associating indexed words with documents

Lets take a example, with steps in full text search in PostgreSQL

Creating Tokens – Parsing document into set of tokens like numbers, words, complex words, email addresses.
Creating Lexemes – Normalization: Dictionary controls this.
- Removal of suffixes – converts variants into a single form (worry, worries, worried, etc.)
- Conversion to lower case
- Remove stop words – common words useless for searching (the, at etc.)
Storing preprocessed documents – Storing documents and creating indexes over them for faster search
Relevance ranking

Features of full text search:

Search similar words(No linguistic support)
Ranking of search results
Searches substrings
- Indexes does not support substring search
- LIKE operator doesn’t use INDEX when preceded by %.
- Low performance when compared to full text search using GIN (Generalized Inverted Index) and GiST (Generalized Search Tree)
Accuracy issue Eg. LIKE %one% matches prone, money, lonely

Indexing

Indexing is the initial part of all search applications.

Its goal is to process the original data into a highly efficient cross-reference lookup in order to facilitate rapid searching.

The job is simple when the content is already textual in nature and its location is known.

Steps

acquiring the content.
- This process gathers and scopes the content that needs to be indexed.
build documents
- The raw content that needs to be indexed has to be translated into the units (usually called documents) used by the search application.
document analysis
- The textual fields in a document cannot be indexed directly. Rather, the text has to be broken into a series of individual atomic elements called tokens.
- This happens during the document analysis step. Each token corresponds roughly to a word in the language, and the analyzer determines how the textual fields in the document are divided into a series of tokens.
index the document
- The final step is to index the document. During the indexing step, the document is added to the index.

Indexing methods:

Inverted index

It is the most used type. This is where words are mapped to their location in a document. It's a “sparse matrix” because it doesn't list all of the words in each document.A “ fully inverted index ” is when words are not only mapped to their location in the documents, but also the location of each occurrence of the word is also mapped.

Term document matrix

Latent semantic indexing

It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious. It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space".

It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents.

Probabilistic LSI (PLSI)

More rubost
Genarative data model
Use EM algo(Expectation maximization to avoid overfitting)
Far more flexible

Searching

Searching provides the result that best match the criteria (specified by the user).

Steps:

clean the user input
create a query
qurey the index
return the result

False Positive Problem

Free text searching could retrive documents which are no relevent to intended search query. Such documents are called false positives. This irrelevent retrival is mainly caused by inherent ambiguity of natural language. Clustering techniques based on Bayesian algo can reduce false positive.

References:

"Full Text Search", chapter 12 at www.postgresql.org/docs/9.5/static
"Pro Full-Text Search in SQL Server", Coles, Michael . 2008 (Version 1 ed.)

Lesson

Searching and Indexing Big Data

Subject

Computer Engineering

Grade

Engineering

Recent Notes

No recent notes.

Related Notes

No related notes.