Tags:
create new tag
view all tags

Project 1

Shakespeare Lookup with Inverted Index
Due: Feb 3 - 11:55 pm

Moodle Link

Invertedindex.png shake.gif

Goal

The goal of this program is to write a program to build an inverted index (using the c++ map STL) of every word in the complete works of Shakespeare. This will allow words to be searched for, and then to print out all of the lines that have that word.

A web based example of what your program will produce is here: Shakespeare Lookup Example

Your program for this assignment will not be on a web page, only a console program. The web version will be part of Project 2. The program must run on the cslab.kenyon.edu server.

Methods

The file Shakespeare.txt is located on cslab.kenyon.edu in the directory /home/class/SoftDev/Shakespeare/ .You will need to write a program to read through ALL the words in the and create in inverted index of the locations of each of the words in a map structure. Then when someone searches for a word, the index for this word can be looked up, and ALL the lines with the word can be printed.

You can download these file to do development on your own coputer, but for credit you must upload the program and make sure it works in the server.

Iterators

"The concept of an iterator is fundamental to understanding the C++ Standard Template Library (STL) because iterators provide a means for accessing data stored in container classes such a vector, map, list, etc."

Learn about iterators here: STL Iterators

One way to iterate through a vector is:

using namespace std;

vector<int> myIntVector;
// Add some elements to myIntVector
myIntVector.push_back(1);
myIntVector.push_back(4);
myIntVector.push_back(8);

for(int y=0; y<myIntVector.size(); y++)
{
    cout<<myIntVector[y]<<" ";
    //Should output 1 4 8
}

However, this can only work if you know the indexes, and can specify them. What if you don't?

using namespace std;

vector<int> myIntVector;
vector<int>::iterator myIntVectorIterator;

// Add some elements to myIntVector
myIntVector.push_back(1);
myIntVector.push_back(4);
myIntVector.push_back(8);

for(myIntVectorIterator = myIntVector.begin(); 
        myIntVectorIterator != myIntVector.end();
        myIntVectorIterator++)
{
    cout<<*myIntVectorIterator<<" ";
    //Should output 1 4 8
}

MAP template

A c++ map is a template in the STL the allows the association of keys and values. There is a documentation on map here. Map allow you to look up any kind of data using any kind of key, and can be any size while being very fast (using a hash lookup strategy). It is like a vector , but one where the index can be any type (not just an integer). An example of mapping from letters to strings is below:

// accessing mapped values
#include <iostream>
#include <map>
#include <string>

int main ()
{
  std::map<char,std::string> mymap;

  mymap['a']="an element";
  mymap['b']="another element";
  mymap['c']=mymap['b'];

  std::cout << "mymap['a'] is " << mymap['a'] << '\n';
  std::cout << "mymap['b'] is " << mymap['b'] << '\n';
  std::cout << "mymap['c'] is " << mymap['c'] << '\n';
  std::cout << "mymap['d'] is " << mymap['d'] << '\n';

  std::cout << "mymap now contains " << mymap.size() << " elements.\n";

  return 0;
}

Map demo to look up Name data

There is a good example of using map on the server in the program that allows the looking of up US Census data on how common names are in the US on the cslab server at: https://cslab.kenyon.edu/class/softdev/skon/name_stats_ajax.html.

The files for this project can be found here on the cslab server: /home/class/SoftDev/namedata.

The source code for a terminal version of the source code of the names data program is here: namesdemo.cpp

The program reads in lines of the US Census name data, as seens below:

MARY           2.629  2.629      1
PATRICIA       1.073  3.702      2
LINDA          1.035  4.736      3
BARBARA        0.980  5.716      4
ELIZABETH      0.937  6.653      5
JENNIFER       0.932  7.586      6
MARIA          0.828  8.414      7
SUSAN          0.794  9.209      8
MARGARET       0.768  9.976      9
DOROTHY        0.727 10.703     10
...

The following code reads a line of US Census data, and puts it in the map:

infile >> namedata.name;
            infile >> namedata.percent;
            infile >> namedata.cumulative;
            infile >> namedata.rank;
            if (infile.fail()) break;
            name_map[namedata.name] = namedata;

The following code looks up the closest match in the Map, and points to it with an iterator:

it = lname_map.lower_bound(name);

The following code then iterates through the closest 10 matches (5 before, 5 after).

// back up 5 places
            for (int i = 0; i < 5 && (it != curMap->begin()); i++) {
                it--;
            }

            // Get 10 results
            for (int i = 0; i < 10 && (it != curMap->end()); i++) {
                result = (*it).second;
                name = result.name;
                percent = result.percent;
                rank = result.rank;
                cout << name << "\t" << percent << "\t" << rank << endl;
                ;
                it++;

            }

The complete code can be found here.

Inverted Index

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.

Consider a book index: index_sample.jpg

Now consider a index of Shakespeare that has a map of words, and that the value for each is a vector of offsets in the file to where the word can be found.

Invertedindex.png

You will need to create a routine to read through each word in the file Shakspeare.txt, and place it in the MAP, along with the file position of the line containing the word in the vector int's that the word maps to.

The following simplified code sample can insert the each word into the map, and push the line position at the end of the vector of positions. Note that is it is the first time the word was since, the vector will be empty, and the position will be the first entry. If the word was already seen, the position will be added on the end.

#include <sstream> // for istringstream

map<string, vector<int> > refs; // The map of words/references
string word, line;
ifstream infile(filename); // open the file
int position = 0;
while (!infile.fail()) {
    getline(infile, line); // get the next line of code
    istringstream lineStream(line); // Create a string stream of the line
    while (lineStream >> word) { // get the next word
        refs[word].push_back(position); // push the word and the line position on the vector for this word
    }
    position = infile.tellg(); // get the poistion of the next line
}

Next we need a way to look up the words positions in the file, given a word. Following is a function that when passed a word, returns a vector of references.

map<string, vector<int> > refs; // The map of words/references
vector <int> indexSearch(string word) {
     map<string, vector<int> >::iterator it;  // iterator for find                                                                
     vector<int> blank; // return for no matches                                                                                  
    /* find the word and get the vector of references */
    /* First use find, so as to NOT create a new entry */
    it = refs.find(word);
    if (it == refs.end()) {
         return(blank);
    } else {
        return (refs[word]);
    }
}

The caller of indexSearch can take the vector returned, and iterate through it to get the positions of all the words.

Note that the indexSearch uses find method on the map. Why? The Map template has a strange operation feature in the if you use the indexdex form to look up, eg:

match = refs[word];

It will create a new entry in the Map if there isn't one there. This is undesirable, as the indexSearch will thereafter find a match for that word, with an empty vector, even though it is not in the file.

The find returns an iterator to the Map, which we don't really need. So after we find it in the Map, we used the refs[word] to get the vector of offsets.

Stemming

When we search for a work, we typical want to find every version of the word. For example, it would be nice if group, groups, grouping, grouped, grouper all mapped the same index. How can we do with? With an English stemmer. a stemmer is a routine the attempts to remove the suffixes of words to find its "stem". This is an imperfect process, as English has many exceptions to simple stemming rules (such as run and ran, rather than run and runned), but we can greatly improve the functions with a simple stemmer.

A stemmer can be found in /home/class/SoftDev/wordStem. The code can be seen here.

File Management

As mentioned above, twe will be storing the positions of each line in the vector associated with each word. We need to understand the file functions that allow the program to determine the current position of a file, and set the position of a file.

How can this be done? By using the iostream functions tellf() and seekg(). These allow you to retrieve the current position in an open file (tellf() ), and also to set the position of a file (seekg() ). When you do a >> or getline() input after a seekg(), the next data read will always be from the point in the file starting from where the seekg()

Function: tellg()
Description: Returns the position of the current character in the input stream.

ifstream is ("Shakespeare.txt");
   int pos = is.tellg(); 
   string line; 
   getline (cin,line); 

Documentation: Link

The position (pos) will be set to the byte at the beginning of the read.

Function: seekg()
Description: Sets the position of the next character to be extracted from the input stream.

is.seekg (0, is.end); // Go to the end of the file
    is.seekg (0, is.beg); // Go to the beginning of the file
    is.seekg (pos, is.beg); // Go to pos from the beginning
    is.seekg (-1, is.cur); // Go to back one character

Documentation: Link

It is recommended that you read in one line at a time using getline(), and then process the words on the line, as shown above.

Also - once you read to the end of file (EOF), these functions will not work. Thus you will need to close and reopen the file after you build the index, but before you use these functions.

Assignment

Below is a list of features you can put in your program, along with the points. Notice that people with Data Structures must do more than studnets with only the introduction course to get 100 points (the perfect score) on the program. If you do more than 100 points, you get extra credit.

Requirement Data Structures Intro Prog Comment Grade
Program displays matching lines for each word entered* 30 50    
Input and output is well designed and formatted* 20 20    
Program broken up into functionally cohesive functions* 10 20    
Program has meaningful comments, good variable names, and good formatting* 10 10    
Program must work on cslab.kenyon.edu server.* 5 5    
Program uses appropriate classes 10 10    
Program uses stemming 10 10    
Program shows book that each match is found in 20 20    
Program highlights the matching words with bold or color. 10 10    
Total        
*These are required in order to get extra credit.

Turn in

  1. All source files you authored
  2. A writeup with the table above, and an indicated of which was included. Paste table into Moodle text field, and put X's in to options you attempted. Also include a statement of the highest computer science course you have taken.
  3. Several runs showing all features.

  • shake.gif:
    shake.gif
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Invertedindex.png r1 manage 15.0 K 2017-01-23 - 02:25 JimSkon  
JPEGjpg index_sample.jpg r1 manage 55.0 K 2017-01-23 - 02:26 JimSkon  
GIFgif shake.gif r1 manage 36.6 K 2017-09-01 - 19:37 JimSkon  
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r16 - 2017-09-01 - JimSkon
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback