Project 1, Part 1

Marvel Character Lookup with Inverted Index
Due: Sept 20 - 11:55 pm

Moodle Link

Janet_van_Dyne_25252528Earth-61625252529_from_Uncanny_Avengers_Vol_3_10_001.jpg Gamora_AIW_Profile.jpg

Goal

The goal of this program is to write a python or C++ program to build an smart index (using the c++ STL or Python modules) of every name in the Marvel character database. This will allow any character to be looked up by name (e.g any word in the name, or by the year of the first appearance. After the search the program will a list of of the matching characters. The program will then allow the user to look at all the information on that character. The trick is to do all this WITHOUT an actual search, but rather with indexes, as done in the name demo program. So you will look up the data in an index, and get a index to a list of the actual data for the matching avenger.

Your program for this assignment will not be on a web page, only a console program. The web version will be part of Project 2. The program must run on the cslab.kenyon.edu server.

Note: Only the students having had Data Structure MUST do two indexes, Other students must only do one, but can do both for extra credit.

Methods

The file marvel-wikia-data.csv is located on cslab.kenyon.edu in the directory /home/class/SoftDev/marvel/ . It may also be found here: https://github.com/fivethirtyeight/data/blob/master/comic-characters/marvel-wikia-data.csv. You will need to write a program to read through ALL the lines, parsing and saving the data to list a of tuples, and also build a set of inverted indexes of the index of character in a map structure or a Python. You will thus have two indexes, on for the names, and one for the years. Then when someone searches for a name or year, the index for this name or year can be looked up, and ALL the Marvel characters with the name or year can be listed. The user can then select one of the from the returned names, and get information on that avenger (full name, year introduced, whether they are alive, or when they died, and their notes.)

There are two main ways to create an index:

  1. Load the characters into a list of character objects, then create an inverted index that maps from name (or years) to a list of character indexes
  2. Build an inverted index that maps from character names or years to byte location in file.

Below - Index of characters in a list of objects.

Marvel2.png

Below - Index into file by byte index:

Marvel1.png

You can download this file to do development on your own computer, but for credit you must upload the program and make sure it works in the server. An on the server you should read the file from it's location at /home/class/SoftDev/marvel/.

Example

If the user selects that they want to search by name, the and the user enters "Quill", they get:

  1. Peter Quill
  2. Quill
  3. Meredith Quill

If they pick 3 they get:

Meredith Quill: Good Character, Brown Eyes, Brown Hair, Female, Deceased, 6 Apperances, 1976

If they enter 1972 they get:

  1. Dancing Water
  2. Doyle
  3. Helen Cobb
  4. Marin
  5. Omega Black
  6. Charlie Cluster-7

And once again they can select a character to get information on.

Iterators (C++)

"The concept of an iterator is fundamental to understanding the C++ Standard Template Library (STL) because iterators provide a means for accessing data stored in container classes such a vector, map, list, etc."

Learn about iterators here: STL Iterators

One way to iterate through a vector is:

#include <iostream>
#include <vector>
using namespace std;

int main(){
   vector<int> myIntVector;
   // Add some elements to myIntVector
   myIntVector.push_back(1);
   myIntVector.push_back(4);
   myIntVector.push_back(8);

   for(int y=0; y<myIntVector.size(); y++)
   {
       cout<<myIntVector[y]<<" ";
       //Should output 1 4 8
   }
}

However, this can only work if you know the indexes, and can specify them. What if you don't?

using namespace std;

vector<int> myIntVector;
vector<int>::iterator myIntVectorIterator;

// Add some elements to myIntVector
myIntVector.push_back(1);
myIntVector.push_back(4);
myIntVector.push_back(8);

for(myIntVectorIterator = myIntVector.begin(); 
        myIntVectorIterator != myIntVector.end();
        myIntVectorIterator++)
{
    cout<<*myIntVectorIterator<<" ";
    //Should output 1 4 8
}

MAP template

A c++ map is a template in the STL the allows the association of keys and values. There is a documentation on map here. Map allow you to look up any kind of data using any kind of key, and can be any size while being very fast (using a hash lookup strategy). It is like a vector , but one where the index can be any type (not just an integer). An example of mapping from letters to strings is below:

// accessing mapped values
#include <iostream>
#include <map>
#include <string>

int main ()
{
  std::map<char,std::string> mymap;

  mymap['a']="an element";
  mymap['b']="another element";
  mymap['c']=mymap['b'];

  std::cout << "mymap['a'] is " << mymap['a'] << '\n';
  std::cout << "mymap['b'] is " << mymap['b'] << '\n';
  std::cout << "mymap['c'] is " << mymap['c'] << '\n';
  std::cout << "mymap['d'] is " << mymap['d'] << '\n';

  std::cout << "mymap now contains " << mymap.size() << " elements.\n";

  return 0;
}

Map demo to look up Name data

There is a good example of using map on the server in the program that allows the looking of up US Census data on how common names are in the US on the cslab server at: https://cslab.kenyon.edu/class/softdev/skon/Names/namelookup.html

The source code for a terminal version of the source code of the names data program is here: namesdemo.cpp

Recall the github site for this project is here.

The program reads in lines of the US Census name data, as seens below:

MARY           2.629  2.629      1
PATRICIA       1.073  3.702      2
LINDA          1.035  4.736      3
BARBARA        0.980  5.716      4
ELIZABETH      0.937  6.653      5
JENNIFER       0.932  7.586      6
MARIA          0.828  8.414      7
SUSAN          0.794  9.209      8
MARGARET       0.768  9.976      9
DOROTHY        0.727 10.703     10
...

The following code reads a line of US Census data, and puts it in the map:

infile >> namedata.name;
            infile >> namedata.percent;
            infile >> namedata.cumulative;
            infile >> namedata.rank;
            if (infile.fail()) break;
            name_map[namedata.name] = namedata;

The following code looks up the closest match in the Map, and points to it with an iterator:

it = lname_map.lower_bound(name);

The following code then iterates through the closest 10 matches (5 before, 5 after).

// back up 5 places
            for (int i = 0; i < 5 && (it != curMap->begin()); i++) {
                it--;
            }

            // Get 10 results
            for (int i = 0; i < 10 && (it != curMap->end()); i++) {
                result = (*it).second;
                name = result.name;
                percent = result.percent;
                rank = result.rank;
                cout << name << "\t" << percent << "\t" << rank << endl;
                ;
                it++;

            }

The complete code can be found here.

Inverted Index

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.

Consider a book index: Index

Now consider a index of file that has a map of name, and that the value for each is a vector of offsets in the file to where the word can be found.

You will need to create a routine to read through each word in the file Shakspeare.txt, and place it in the MAP, along with the file position of the line containing the word in the vector int's that the word maps to.

The following simplified code sample can insert the each word into the map, and push the line position at the end of the vector of positions. Note that is it is the first time the word was since, the vector will be empty, and the position will be the first entry. If the word was already seen, the position will be added on the end.

#include <sstream> // for istringstream

map<string, vector<int> > refs; // The map of words/references
string word, line;
ifstream infile(filename); // open the file
int position = 0;
while (!infile.fail()) {
    getline(infile, line); // get the next line of code
    istringstream lineStream(line); // Create a string stream of the line
    while (lineStream >> word) { // get the next word
        refs[word].push_back(position); // push the word and the line position on the vector for this word
    }
    position = infile.tellg(); // get the poistion of the next line
}

You, of course, will need to normalize the capitalization (convert all to lower case), and remove special characters. You can see how to do the latter here: link.

Next we need a way to look up the words positions in the file, given a word. Following is a function that when passed a word, returns a vector of references.

map<string, vector<int> > refs; // The map of words/references
vector <int> indexSearch(string word) {
     map<string, vector<int> >::iterator it;  // iterator for find                                                                
     vector<int> blank; // return for no matches                                                                                  
    /* find the word and get the vector of references */
    /* First use find, so as to NOT create a new entry */
    it = refs.find(word);
    if (it == refs.end()) {
         return(blank);
    } else {
        return (refs[word]);
    }
}

The caller of indexSearch can take the vector returned, and iterate through it to get the positions of all the words.

Note that the indexSearch uses find method on the map. Why? The Map template has a strange operation feature in the if you use the indexdex form to look up, eg:

match = refs[word];

It will create a new entry in the Map if there isn't one there. This is undesirable, as the indexSearch will thereafter find a match for that word, with an empty vector, even though it is not in the file.

The find returns an iterator to the Map, which we don't really need. So after we find it in the Map, we used the refs[word] to get the vector of offsets.

File Random Access C++

As mentioned above, we will be storing the positions of each line in the vector associated with each word. We need to understand the file functions that allow the program to determine the current position of a file, and set the position of a file. Directly access a location in a file is called random access.

How can this be done? By using the iostream functions tellf() and seekg(). These allow you to retrieve the current position in an open file (tellf() ), and also to set the position of a file (seekg() ). When you do a >> or getline() input after a seekg(), the next data read will always be from the point in the file starting from where the seekg()

Function: tellg()
Description: Returns the position of the current character in the input stream.

ifstream is ("/home/class/SoftDev/marvel/marvel-wikia-data.csv");
   int pos = is.tellg(); 
   string line; 
   getline (cin,line); 

Documentation: Link

The position (pos) will be set to the byte at the beginning of the read.

Function: seekg()
Description: Sets the position of the next character to be extracted from the input stream.

is.seekg (0, is.end); // Go to the end of the file
    is.seekg (0, is.beg); // Go to the beginning of the file
    is.seekg (pos, is.beg); // Go to pos from the beginning  (This is what you will use)
    is.seekg (-1, is.cur); // Go to back one character

Documentation: Link

It is recommended that you read in one line at a time using getline(), and then process the words on the line, as shown above.

Also - once you read to the end of file (EOF), these functions will not work. Thus you will need to close and reopen the file after you build the index, but before you use these functions.

File Management Python

In Python you have the following functions for random file access:

seek(offset,from= SEEK_SET) Change the file position to offset bytes, in reference to from (start, current, end).
tell() Returns the current file location.
import os
    
    filename = '/tmp/data.txt'
    with open(filename, 'w') as fh:
    fh.write("Hello World!\nHow are you today?\nThank you!")
    
    print(os.path.getsize(filename)) # 42
    
    with open(filename) as fh:
     print(fh.tell()) # 0
     row = fh.readline()
     print(row) # Hello World!
     print(fh.tell()) # 13
     
    fh.seek(-7, os.SEEK_CUR)
     print(fh.tell()) # 6
     
     row = fh.readline()
     print(row) # World!
     print(fh.tell()) # 13
     
     fh.seek(0, os.SEEK_SET)
     print(fh.tell()) # 0
     print(fh.read(5)) # Hello
     
     fh.seek(-4, os.SEEK_END)
     print(fh.tell()) # 38
     print(fh.read()) # you!
     print(fh.tell()) # 42

Assignment

Below is a list of features you can put in your program, along with the points. Notice that students with Data Structures must do more than students with only the introduction course to get 100 points (the perfect score) on the program. If you do more than 100 points, you get extra credit.

Requirement Data Structures Intro Prog Comment Grade
Program displays matching characters for each name entered 15 25    
Program displays matching characters for each year entered 15 25    
Program allows search for more than one name, which may appear in any order, but all must match the character. 20 20    
Input and output is well designed and formatted* 20 20    
Program broken up into functionally cohesive functions* 15 25    
Program has meaningful comments, good variable names, and good formatting* 10 10    
Program must work on cslab.kenyon.edu server.* 5 5    
Program uses appropriate classes 15 15    
Total        
*These are required in order to get any extra credit.

Turn in

  1. All source files you authored
  2. A writeup with the table above, and an indicated of which was included. Paste table into Moodle text field, and put X's in to options you attempted. Also include a statement of the highest computer science course you have taken.
  3. Several runs showing all features.

  • Marvel2.png:

  • Marvel1.png:
Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg Gamora_AIW_Profile.jpg r1 manage 858.9 K 2019-09-09 - 13:24 JimSkon  
HTMLhtml InvertedIndexExample.html r1 manage 38.0 K 2017-09-01 - 19:37 JimSkon  
PNGpng Invertedindex.png r1 manage 20.0 K 2017-09-01 - 19:39 JimSkon  
JPEGjpg Janet_van_Dyne_25252528Earth-61625252529_from_Uncanny_Avengers_Vol_3_10_001.jpg r1 manage 45.0 K 2019-09-08 - 23:53 JimSkon  
PNGpng Marvel1.png r1 manage 29.9 K 2019-09-20 - 14:47 JimSkon  
PNGpng Marvel2.png r1 manage 28.7 K 2019-09-20 - 14:47 JimSkon  
PNGpng MarvelIndex.png r1 manage 31.3 K 2019-09-09 - 13:55 JimSkon  
Unknown file formatmd README.md r1 manage 68.6 K 2019-09-08 - 23:44 JimSkon  
Unknown file formatcsv avengers.csv r1 manage 198.8 K 2019-09-08 - 23:44 JimSkon  
Unknown file formatcsv marvel-wikia-data.csv r1 manage 2350.2 K 2019-09-09 - 00:13 JimSkon  
GIFgif shake.gif r1 manage 36.6 K 2017-09-01 - 19:38 JimSkon  
Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r23 - 2019-09-26 - JimSkon
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback