PicoSearch - TF-IDF in 50 lines

Tuesday, December 17, 2024 5:43 PM

Introduction

For a long time I wanted to implement simple yet effective search using the TF-IDF¹ algorithm. I finally got around to it, and it's a lot simpler than I thought. The algorithm is simple and the implementation is straightforward.

To get the line count to 50 I cheated a little bit by adding intermediate variables, comments and whitespace. This doubled the line count exactly from 25 to 50 😄

Source code

var documents = new[]
{
    "i like apples",
    "i like pears",
    "i like fruit like oranges",
    "i hate bananas"
};


// build index

var index =
    from document in documents

    // term frequency - number of times a term appears in a document
    let terms = System.Text.RegularExpressions.Regex.Split(document, @"\W+")
    let termFrequencies = terms
        .GroupBy(t => t)
        .Select(g => (term: g.Key, frequency: (double)g.Count() / terms.Length))

    // inverse document frequency - log of the ratio of the *number of documents* to the *number of documents containing the term*
    let uniqueTerms = terms.Distinct()
    let inverseDocumentFrequencies =
        from term in uniqueTerms
        let documentCount = documents.Count(d => d.Contains(term))
        select (term, score: Math.Log((double)documents.Length / documentCount))

    // calculate tf*idf
    let scores =
        from tf in termFrequencies
        join idf in inverseDocumentFrequencies
            on tf.term equals idf.term
        select (tf.term, score: tf.frequency * idf.score)

    // document scores per term
    select (document, scores);


// search index

var results =
    from result in index
    from scores in result.scores
    where scores.term == "like"
    orderby scores.score descending
    select (result.document, scores.score);

results.Select((r, i) => $"{1 + i}. {r.document,-25} {r.score:F5}")
    .ToList()
    .ForEach(Console.WriteLine);

Output for term ‘like’:

Rank	Document	Score
#1	i like fruit like oranges	0.11507
#2	i like apples	0.09589
#3	i like pears	0.09589

Performance

Performance is good enough for small corpora, but it's not the most efficient implementation. It's a good starting point for understanding the algorithm though.

Files

Sources

Wikipedia: TF-IDF ↩

Introduction

Source code

Performance

Files

Sources

Related