PicoSearch - TF-IDF in 50 lines
Introduction
For a long time I wanted to implement simple yet effective search using the TF-IDF1 algorithm. I finally got around to it, and it's a lot simpler than I thought. The algorithm is simple and the implementation is straightforward.
To get the line count to 50 I cheated a little bit by adding intermediate variables, comments and whitespace. This doubled the line count exactly from 25 to 50 😄
Source code
var documents = new[]{"i like apples","i like pears","i like fruit like oranges","i hate bananas"};// build indexvar index =from document in documents// term frequency - number of times a term appears in a documentlet terms = System.Text.RegularExpressions.Regex.Split(document, @"\W+")let termFrequencies = terms.GroupBy(t => t).Select(g => (term: g.Key, frequency: (double)g.Count() / terms.Length))// inverse document frequency - log of the ratio of the *number of documents* to the *number of documents containing the term*let uniqueTerms = terms.Distinct()let inverseDocumentFrequencies =from term in uniqueTermslet documentCount = documents.Count(d => d.Contains(term))select (term, score: Math.Log((double)documents.Length / documentCount))// calculate tf*idflet scores =from tf in termFrequenciesjoin idf in inverseDocumentFrequencieson tf.term equals idf.termselect (tf.term, score: tf.frequency * idf.score)// document scores per termselect (document, scores);// search indexvar results =from result in indexfrom scores in result.scoreswhere scores.term == "like"orderby scores.score descendingselect (result.document, scores.score);results.Select((r, i) => $"{1 + i}. {r.document,-25} {r.score:F5}").ToList().ForEach(Console.WriteLine);
Output for term ‘like’:
| Rank | Document | Score |
|---|---|---|
| #1 | i like fruit like oranges | 0.11507 |
| #2 | i like apples | 0.09589 |
| #3 | i like pears | 0.09589 |
Performance
Performance is good enough for small corpora, but it's not the most efficient implementation. It's a good starting point for understanding the algorithm though.

