Dictionaries and Web Indexer Part 1

Dictionaries Overview

Dictionaries are a table or map of key-value pairs. Each key maps to its corresponding value.
Lists are indexed by positions: L[0], L[1], L[2], etc.
Dictionaries are indexed by keys: D[key] returns the value mapped to key.

Lists vs. Dictionaries

If you have a sequence of 100 elements (e.g., the top 100 baseball players), a list is useful because you can access an element by position (players[i]).

Dictionaries are better when order doesn’t matter. For example, storing which players belong to which teams would be easier with a dictionary:

teams = {
    "Yankees": ["Player A", "Player B"],
    "Red Sox": ["Player C", "Player D"]
}

  The general approach to using dictionaries when solving a problem:
      What do I know? (keys)
      What do I want to know? (values)
  Example: If you have ages and want to find who has that age, the dictionary would be structured as:

  ages = {
      20: ["Alice", "Bob"],
      25: ["Charlie"]
  }

  Lists as values: Since dictionary keys must be unique, we use lists when multiple values are needed.

Web Indexing Problem Why Build an Index?

The goal of an index is to quickly find which sites contain specific words.
What do we know? → The words we care about.
What do we want to know? → The sites that contain those words.

Dictionary Structure for Indexing

The keys should be words.
The values should be a list of websites that contain those words.

index = {
    "python": ["site1.com", "site2.com"],
    "coding": ["site3.com"]
}

Textbook Analogy

Think of a textbook index: it helps locate which pages contain specific words.
Similarly, a web index helps locate which websites contain certain words.

Implementation Details Function: index_site

Arguments:
    site: A string (URL of the site).
    text: A single string containing all the text from the site.
This function indexes only one site at a time (like adding a new page to a textbook).
Use .split() to extract words from text.
Convert text to lowercase to avoid duplicate entries due to case differences.
Do not print or request input inside this function!

Function: search_single_word

Should return a list of sites that contain the given word.
If no sites are found, return an empty list, not a printed message.

Key Considerations

Avoid duplicate sites in results.
Ensure case insensitivity by lowercasing the text.
Test your search functionality and refine details.

Web Indexer - Part 2

This project enhances a basic web indexer by tracking word frequencies and improving search efficiency.

Data Structure for Indexing

Instead of storing words as keys mapping to a list of sites, we modify the structure to store word → {site: frequency} mappings:

index = {
    "mit": {"web.mit.edu": 3, "csail.mit.edu": 2},
    "google": {"web.mit.edu": 1, "google.com": 3}
}

Step 1: Implementing index_site

This function processes the text of a given site and updates the index accordingly.

import re

def index_site(index, site, content):
    words = re.findall(r'\w+', content.lower())  # Extract words and convert to lowercase
    for word in words:
        if word not in index:
            index[word] = {site: 1}  # Case 1: New word
        else:
            if site in index[word]:
                index[word][site] += 1  # Case 3: Word exists, site exists → Increment count
            else:
                index[word][site] = 1  # Case 2: Word exists, site doesn't → Add site with count 1

Step 2: Implementing search_single_word

This function retrieves a sorted list of (site, frequency) tuples for a given word.

def search_single_word(index, word):
    if word in index:
        result = list(index[word].items())  # Convert site-frequency dictionary to a list
        result.sort(key=lambda x: x[1], reverse=True)  # Sort by frequency (highest first)
        return result
    return []

Example Usage:

index = {
    "mit": {"web.mit.edu": 3, "csail.mit.edu": 2},
    "google": {"web.mit.edu": 1, "google.com": 3}
}

print(search_single_word(index, "mit"))
# Output: [('web.mit.edu', 3), ('csail.mit.edu', 2)]

Step 3: Implementing search_multiple_words

This function searches for multiple words and aggregates their frequencies across sites.

def search_multiple_words(index, words):
    site_freq = {}

    for word in words:
        for site, freq in search_single_word(index, word):
            if site in site_freq:
                site_freq[site] += freq  # Combine frequencies across words
            else:
                site_freq[site] = freq

    result = list(site_freq.items())
    result.sort(key=lambda x: x[1], reverse=True)  # Sort by frequency
    return result

Example Usage:

print(search_multiple_words(index, ["mit", "google"]))
# Output: [('web.mit.edu', 4), ('google.com', 3), ('csail.mit.edu', 2)]

python data structures

Dictionaries and Web Indexer Part 1

Dictionaries Overview

Lists vs. Dictionaries

Web Indexer - Part 2

Data Structure for Indexing

Graded Quiz