PicoSearch - TF-IDF in 50 lines
For a long time I wanted to implement simple yet effective search using the TF-IDF1 algorithm. I finally got around to it, and it's a lot simpler than I thought. The algorithm is simple and the implementation is straightforward.
To get the line count to 50 I cheated a little bit by adding intermediate variables, comments and whitespace. This doubled the line count exactly from 25 to 50 😄
Source code
var documents = new[]
{
"i like apples",
"i like pears",
"i like fruit like oranges",
"i hate bananas"
};
// build index
var index =
from document in documents
// term frequency - number of times a term appears in a document
let terms = System.Text.RegularExpressions.Regex.Split(document, @"\W+")
let termFrequencies = terms
.GroupBy(t => t)
.Select(g => (term: g.Key, frequency: (double)g.Count() / terms.Length))
// inverse document frequency - log of the ratio of the *number of documents* to the *number of documents containing the term*
let uniqueTerms = terms.Distinct()
let inverseDocumentFrequencies =
from term in uniqueTerms
let documentCount = documents.Count(d => d.Contains(term))
select (term, score: Math.Log((double)documents.Length / documentCount))
// calculate tf*idf
let scores =
from tf in termFrequencies
join idf in inverseDocumentFrequencies
on tf.term equals idf.term
select (tf.term, score: tf.frequency * idf.score)
// document scores per term
select (document, scores);
// search index
var results =
from result in index
from scores in result.scores
where scores.term == "like"
orderby scores.score descending
select (result.document, scores.score);
results.Select((r, i) => $"{1 + i}. {r.document,-25} {r.score:F5}")
.ToList()
.ForEach(Console.WriteLine);
Output for term ‘like’:
Rank | Document | Score |
---|---|---|
#1 | i like fruit like oranges | 0.11507 |
#2 | i like apples | 0.09589 |
#3 | i like pears | 0.09589 |
Performance
Performance is good enough for small corpora, but it's not the most efficient implementation. It's a good starting point for understanding the algorithm though.