python data structures

Dictionaries and Web Indexer Part 1

Dictionaries Overview

Lists vs. Dictionaries

Web Indexing Problem Why Build an Index?

The goal of an index is to quickly find which sites contain specific words.
What do we know? → The words we care about.
What do we want to know? → The sites that contain those words.

Dictionary Structure for Indexing

The keys should be words.
The values should be a list of websites that contain those words.

index = {
    "python": ["site1.com", "site2.com"],
    "coding": ["site3.com"]
}

Textbook Analogy

Think of a textbook index: it helps locate which pages contain specific words.
Similarly, a web index helps locate which websites contain certain words.

Implementation Details Function: index_site

Arguments:
    site: A string (URL of the site).
    text: A single string containing all the text from the site.
This function indexes only one site at a time (like adding a new page to a textbook).
Use .split() to extract words from text.
Convert text to lowercase to avoid duplicate entries due to case differences.
Do not print or request input inside this function!

Function: search_single_word

Should return a list of sites that contain the given word.
If no sites are found, return an empty list, not a printed message.

Key Considerations

Avoid duplicate sites in results.
Ensure case insensitivity by lowercasing the text.
Test your search functionality and refine details.

Web Indexer - Part 2

This project enhances a basic web indexer by tracking word frequencies and improving search efficiency.

Data Structure for Indexing

Instead of storing words as keys mapping to a list of sites, we modify the structure to store word → {site: frequency} mappings:

index = {
    "mit": {"web.mit.edu": 3, "csail.mit.edu": 2},
    "google": {"web.mit.edu": 1, "google.com": 3}
}

Step 1: Implementing index_site

This function processes the text of a given site and updates the index accordingly.

import re

def index_site(index, site, content):
    words = re.findall(r'\w+', content.lower())  # Extract words and convert to lowercase
    for word in words:
        if word not in index:
            index[word] = {site: 1}  # Case 1: New word
        else:
            if site in index[word]:
                index[word][site] += 1  # Case 3: Word exists, site exists → Increment count
            else:
                index[word][site] = 1  # Case 2: Word exists, site doesn't → Add site with count 1

Step 2: Implementing search_single_word

This function retrieves a sorted list of (site, frequency) tuples for a given word.

def search_single_word(index, word):
    if word in index:
        result = list(index[word].items())  # Convert site-frequency dictionary to a list
        result.sort(key=lambda x: x[1], reverse=True)  # Sort by frequency (highest first)
        return result
    return []

Example Usage:

index = {
    "mit": {"web.mit.edu": 3, "csail.mit.edu": 2},
    "google": {"web.mit.edu": 1, "google.com": 3}
}

print(search_single_word(index, "mit"))
# Output: [('web.mit.edu', 3), ('csail.mit.edu', 2)]

Step 3: Implementing search_multiple_words

This function searches for multiple words and aggregates their frequencies across sites.

def search_multiple_words(index, words):
    site_freq = {}

    for word in words:
        for site, freq in search_single_word(index, word):
            if site in site_freq:
                site_freq[site] += freq  # Combine frequencies across words
            else:
                site_freq[site] = freq

    result = list(site_freq.items())
    result.sort(key=lambda x: x[1], reverse=True)  # Sort by frequency
    return result

Example Usage:

print(search_multiple_words(index, ["mit", "google"]))
# Output: [('web.mit.edu', 4), ('google.com', 3), ('csail.mit.edu', 2)]

Quiz

To mark this module as complete, you must finish this quiz. Once submitted, you'll need to wait 2 hours before attempting it again.