Difference: SdProject1 (1 vs. 16)

Revision 162017-09-01 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Line: 274 to 274
 
  1. A writeup with the table above, and an indicated of which was included. Paste table into Moodle text field, and put X's in to options you attempted. Also include a statement of the highest computer science course you have taken.
  2. Several runs showing all features.
Added:
>
>
  • shake.gif:
    shake.gif
 
META FILEATTACHMENT attachment="Invertedindex.png" attr="" comment="" date="1485138335" name="Invertedindex.png" path="Invertedindex.png" size="15336" user="JimSkon" version="1"
META FILEATTACHMENT attachment="index_sample.jpg" attr="" comment="" date="1485138402" name="index_sample.jpg" path="index_sample.jpg" size="56289" user="JimSkon" version="1"
Changed:
<
<
META FILEATTACHMENT attachment="shake.gif" attr="" comment="" date="1485149004" name="shake.gif" path="shake.gif" size="37508" user="JimSkon" version="1"
>
>
META FILEATTACHMENT attachment="shake.gif" attr="" comment="" date="1504294678" name="shake.gif" path="shake.gif" size="37508" user="JimSkon" version="1"

Revision 152017-02-04 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Line: 256 to 255
 Also - once you read to the end of file (EOF), these functions will not work. Thus you will need to close and reopen the file after you build the index, but before you use these functions.

Assignment

Changed:
<
<
Below is a list of features you can put in your program, along with the points. Notice that people with Data Structures must do more than studnets with only the introduction course to get 100 points (the perfect score) on the program. If you do more than 100 points, you getr extra credit.
>
>
Below is a list of features you can put in your program, along with the points. Notice that people with Data Structures must do more than studnets with only the introduction course to get 100 points (the perfect score) on the program. If you do more than 100 points, you get extra credit.
 
Requirement Data Structures Intro Prog Comment Grade
Program displays matching lines for each word entered* 30 50    

Revision 142017-02-04 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Line: 258 to 258
  Below is a list of features you can put in your program, along with the points. Notice that people with Data Structures must do more than studnets with only the introduction course to get 100 points (the perfect score) on the program. If you do more than 100 points, you getr extra credit.
Changed:
<
<
Requirement Data Structures Intro Prog Done
Program displays matching lines for each word entered* 30 50  
Input and output is well designed and formatted* 20 20  
Program broken up into functionally cohesive functions* 10 20  
Program has meaningful comments, good variable names, and good formatting* 10 10  
Program must work on cslab.kenyon.edu server.* 5 5  
Program uses appropriate classes 10 10  
Program uses stemming 10 10  
Program shows book that each match is found in 20 20  
Program highlights the matching words with bold or color. 10 10  
>
>
Requirement Data Structures Intro Prog Comment Grade
Program displays matching lines for each word entered* 30 50    
Input and output is well designed and formatted* 20 20    
Program broken up into functionally cohesive functions* 10 20    
Program has meaningful comments, good variable names, and good formatting* 10 10    
Program must work on cslab.kenyon.edu server.* 5 5    
Program uses appropriate classes 10 10    
Program uses stemming 10 10    
Program shows book that each match is found in 20 20    
Program highlights the matching words with bold or color. 10 10    
Total        
 *These are required in order to get extra credit.

Turn in

  1. All source files you authored

Revision 132017-02-02 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Project 1

Shakespeare Lookup with Inverted Index
Changed:
<
<
Due: Feb 1 - 11:55 pm
>
>
Due: Feb 3 - 11:55 pm
 Moodle Link
Invertedindex.png shake.gif

Goal

Revision 122017-02-02 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Line: 14 to 14
  A web based example of what your program will produce is here: Shakespeare Lookup Example
Changed:
<
<
Your program for this assignment will not be on a web page, only a console program. The web version will be part of Project 2.
>
>
Your program for this assignment will not be on a web page, only a console program. The web version will be part of Project 2. The program must run on the cslab.kenyon.edu server.
 

Methods

The file Shakespeare.txt is located on cslab.kenyon.edu in the directory /home/class/SoftDev/Shakespeare/ .You will need to write a program to read through ALL the words in the and create in inverted index of the locations of each of the words in a map structure. Then when someone searches for a word, the index for this word can be looked up, and ALL the lines with the word can be printed.

Added:
>
>
You can download these file to do development on your own coputer, but for credit you must upload the program and make sure it works in the server.
 
Iterators

"The concept of an iterator is fundamental to understanding the C++ Standard Template Library (STL) because iterators provide a means for accessing data stored in container classes such a vector, map, list, etc."

Line: 261 to 262
 
Input and output is well designed and formatted* 20 20  
Program broken up into functionally cohesive functions* 10 20  
Program has meaningful comments, good variable names, and good formatting* 10 10  
Added:
>
>
Program must work on cslab.kenyon.edu server.* 5 5  
 
Program uses appropriate classes 10 10  
Program uses stemming 10 10  
Program shows book that each match is found in 20 20  

Revision 112017-02-01 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Line: 170 to 171
 The following simplified code sample can insert the each word into the map, and push the line position at the end of the vector of positions. Note that is it is the first time the word was since, the vector will be empty, and the position will be the first entry. If the word was already seen, the position will be added on the end. %CODE{"c++"}%
Added:
>
>
#include // for istringstream
 map<string, vector > refs; // The map of words/references string word, line; ifstream infile(filename); // open the file

Revision 102017-02-01 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Revision 92017-01-30 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Line: 165 to 165
  Now consider a index of Shakespeare that has a map of words, and that the value for each is a vector of offsets in the file to where the word can be found.

Invertedindex.png
Changed:
<
<
You will need to create a routine to read through each word in the file Shakspeare.txt, and place it in the MAP, along with its position in the file (or string) in the vector that the word maps to.
>
>
You will need to create a routine to read through each word in the file Shakspeare.txt, and place it in the MAP, along with the file position of the line containing the word in the vector int's that the word maps to.
 
Changed:
<
<
Assuming you have a have a routine that gets the next word in the file, and returns the word and it's integer position, the following simplified code sample can insert the each word into the map, and push the position at the end of the vector of positions. Note that is it is the first time the word was since, the vector will be empty, and the position will be the first entry. If the word was already seen, the position will be added on the end.
>
>
The following simplified code sample can insert the each word into the map, and push the line position at the end of the vector of positions. Note that is it is the first time the word was since, the vector will be empty, and the position will be the first entry. If the word was already seen, the position will be added on the end.
  %CODE{"c++"}% map<string, vector > refs; // The map of words/references
Changed:
<
<
string word; int position;
>
>
string word,line; ifstream infile(filename); // open the file int position=0;
 while (infile.fail()) {
Changed:
<
<
getNextWord(word, position); refs[word].push_back(position);
>
>
getline(infile,line); // get the next line of code istringstream lineStream(line); // Create a string stream of the line while (lineStream >> word) { // get the next word refs[word].push_back(position); // push the word and the line position on the vector for this word } position = infile.tellg(); // get the poistion of the next line
 } %ENDCODE%
Deleted:
<
<
 Next we need a way to look up the words positions in the file, given a word. Following is a function that when passed a word, returns a vector of references. %CODE{"c++"}%
Line: 213 to 217
 When we search for a work, we typical want to find every version of the word. For example, it would be nice if group, groups, grouping, grouped, grouper all mapped the same index. How can we do with? With an English stemmer. a stemmer is a routine the attempts to remove the suffixes of words to find its "stem". This is an imperfect process, as English has many exceptions to simple stemming rules (such as run and ran, rather than run and runned), but we can greatly improve the functions with a simple stemmer.

A stemmer can be found in /home/class/SoftDev/wordStem. The code can be seen here.

Changed:
<
<

Data Management

>
>

File Management

 
Changed:
<
<
As mentioned above, there are two major methods of dealing with the text content. One is to read the entire text into a large string, and then work on that. While this is fast (no file I/O) it is not scable to larger document repositories. It also requires the permanent allocation of a large data strucuture for as long as the program is running. Storing the books in a single long string is simpler to implement, and is the expected solution for students who have only had Intro to Programming.

The other solution is to not keep the file on memory, but rather to look up the needed data in the file whenever a search is done.

>
>
As mentioned above, twe will be storing the positions of each line in the vector associated with each word. We need to understand the file functions that allow the program to determine the current position of a file, and set the position of a file.
  How can this be done? By using the iostream functions tellf() and seekg(). These allow you to retrieve the current position in an open file (tellf() ), and also to set the position of a file (seekg() ). When you do a >> or getline() input after a seekg(), the next data read will always be from the point in the file starting from where the seekg()
Line: 243 to 245
  Documentation: Link
Changed:
<
<
It is recommended that you read in one line at a time using getline(), and then process the words on the line.

Displaying the line of a match

One issue is that, based on the suggested design, the position in the index is the position of the word. How do you find the beginning and end of the line that the word in on? One solution is to scan backwards one character at a timeuntil you either find a '\n' or the beginning of the file. This option is very appropriate is the text is stored in a string If it is in the file, then you can use

<-- SyntaxHighlightingPlugin -->
is.seekg (-2, is.cur); 
<-- end SyntaxHighlightingPlugin -->

to move backwards one character as a time. (-2 because -1 gets you back to the character just read)

Once you find the beginning of the line you can find the end by either scanning forward for a '\n' or end of string, or (if using the file) just doing a getline() on the file.

Another cool option is to use a vector off pairs of integers in the index. The one int will be the location of the word, the other the beginning of the line (e.g. <2034,2012> for the words at 2034, and that start of the line at 2012.) In order to do this you will have to make a structure of two ints.

>
>
It is recommended that you read in one line at a time using getline(), and then process the words on the line, as shown above.
 
Added:
>
>
Also - once you read to the end of file (EOF), these functions will not work. Thus you will need to close and reopen the file after you build the index, but before you use these functions.
 

Assignment

Below is a list of features you can put in your program, along with the points. Notice that people with Data Structures must do more than studnets with only the introduction course to get 100 points (the perfect score) on the program. If you do more than 100 points, you getr extra credit.

Requirement Data Structures Intro Prog Done
Program displays matching lines for each word entered* 30 50  
Changed:
<
<
Input and output is well designed and formatted* 10 20  
>
>
Input and output is well designed and formatted* 20 20  
 
Program broken up into functionally cohesive functions* 10 20  
Program has meaningful comments, good variable names, and good formatting* 10 10  
Program uses appropriate classes 10 10  
Program uses stemming 10 10  
Deleted:
<
<
Program works directly on files using seekg() 20 20  
Program reads file into program, and works on an array (large string) (mutually exclusive with above) 5 10  
 
Program shows book that each match is found in 20 20  
Added:
>
>
Program highlights the matching words with bold or color. 10 10  
 *These are required in order to get extra credit.

Turn in

  1. All source files you authored

Revision 82017-01-30 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Line: 175 to 174
 string word; int position; while (infile.fail()) {
Changed:
<
<
getNextWord(infile, word, position);
>
>
getNextWord(word, position);
  refs[word].push_back(position); } %ENDCODE%

Revision 72017-01-27 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Project 1

Shakespeare Lookup with Inverted Index
Changed:
<
<
Due: Jan 30 - 11:55 pm
>
>
Due: Feb 1 - 11:55 pm
 Moodle Link
Invertedindex.png shake.gif

Goal

Revision 62017-01-24 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Line: 180 to 179
 } %ENDCODE%
Changed:
<
<
Next we need a way to look up the words positions in the file, given a word. Following is a function that when pasted a word, returns the vector of references.
>
>
Next we need a way to look up the words positions in the file, given a word. Following is a function that when passed a word, returns a vector of references.
  %CODE{"c++"}% map<string, vector > refs; // The map of words/references
Line: 216 to 215
 A stemmer can be found in /home/class/SoftDev/wordStem. The code can be seen here.

Data Management

Changed:
<
<
As mentioned above, there are two major methods of dealing with the text content. One is to read the entire text into a large string, and then work on that. While this is fast (no file I/O) it is not scable to larger document repositories. It also requires the perminate allocation of a larger data strucuture as long as the program is running. This methid is simpler to implement, and is the expected solution for student who have only had Intro to Programming.
>
>
As mentioned above, there are two major methods of dealing with the text content. One is to read the entire text into a large string, and then work on that. While this is fast (no file I/O) it is not scable to larger document repositories. It also requires the permanent allocation of a large data strucuture for as long as the program is running. Storing the books in a single long string is simpler to implement, and is the expected solution for students who have only had Intro to Programming.
 
Changed:
<
<
The other solution is to not keep the file on memory, but rather to look up the needed data whenever a search is done.
>
>
The other solution is to not keep the file on memory, but rather to look up the needed data in the file whenever a search is done.
 
Changed:
<
<
How can this be done? By using the iostream functions tellf() and seekg(). These allow you to retrieve the current position in an open file, and to set the position of a file.
>
>
How can this be done? By using the iostream functions tellf() and seekg(). These allow you to retrieve the current position in an open file (tellf() ), and also to set the position of a file (seekg() ). When you do a >> or getline() input after a seekg(), the next data read will always be from the point in the file starting from where the seekg()
  Function: tellg()
Description: Returns the position of the current character in the input stream. %CODE{"c++"}%
Line: 273 to 272
 *These are required in order to get extra credit.

Turn in

  1. All source files you authored
Changed:
<
<
  1. A writeup with the table above, and an indicated of which was included. Paste table into Moodle text field, and put X's in to options you attempted.
>
>
  1. A writeup with the table above, and an indicated of which was included. Paste table into Moodle text field, and put X's in to options you attempted. Also include a statement of the highest computer science course you have taken.
 
  1. Several runs showing all features.

META FILEATTACHMENT attachment="Invertedindex.png" attr="" comment="" date="1485138335" name="Invertedindex.png" path="Invertedindex.png" size="15336" user="JimSkon" version="1"

Revision 52017-01-23 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Line: 7 to 7
 
Shakespeare Lookup with Inverted Index
Due: Jan 30 - 11:55 pm
Changed:
<
<
Moodle Link
shake.gif
>
>
Moodle Link
Invertedindex.png shake.gif
 

Goal

The goal of this program is to write a program to build an inverted index (using the c++ map STL) of every word in the complete works of Shakespeare. This will allow words to be searched for, and then to print out all of the lines that have that word.

Added:
>
>
A web based example of what your program will produce is here: Shakespeare Lookup Example

Your program for this assignment will not be on a web page, only a console program. The web version will be part of Project 2.

 

Methods

The file Shakespeare.txt is located on cslab.kenyon.edu in the directory /home/class/SoftDev/Shakespeare/ .You will need to write a program to read through ALL the words in the and create in inverted index of the locations of each of the words in a map structure. Then when someone searches for a word, the index for this word can be looked up, and ALL the lines with the word can be printed.

Line: 264 to 269
 
Program uses stemming 10 10  
Program works directly on files using seekg() 20 20  
Program reads file into program, and works on an array (large string) (mutually exclusive with above) 5 10  
Changed:
<
<
Program shows book for each match is found in 20 20  
>
>
Program shows book that each match is found in 20 20  
 *These are required in order to get extra credit.

Turn in

  1. All source files you authored

Revision 42017-01-23 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"
Added:
>
>

 

Project 1

Shakespeare Lookup with Inverted Index
Due: Jan 30 - 11:55 pm
Changed:
<
<

Goal

>
>
Moodle Link
shake.gif

Goal

  The goal of this program is to write a program to build an inverted index (using the c++ map STL) of every word in the complete works of Shakespeare. This will allow words to be searched for, and then to print out all of the lines that have that word.

Methods

Changed:
<
<
The file Shakespeare.txt is located on cslab.kenyon.edu in the directory /home/class/SoftDev/Shakespeare/ .You will need to write a program to read through ALL the words in the and create in inverted list of the locations of each of the words in a map structure. Then when someone searches for a word, the inde for this word can be looked up, and ALL the lines with the word can be printed.
>
>
The file Shakespeare.txt is located on cslab.kenyon.edu in the directory /home/class/SoftDev/Shakespeare/ .You will need to write a program to read through ALL the words in the and create in inverted index of the locations of each of the words in a map structure. Then when someone searches for a word, the index for this word can be looked up, and ALL the lines with the word can be printed.
 
Iterators

"The concept of an iterator is fundamental to understanding the C++ Standard Template Library (STL) because iterators provide a means for accessing data stored in container classes such a vector, map, list, etc."

Changed:
<
<
Learn about itertors here: STL Iterators
>
>
Learn about iterators here: STL Iterators

One way to iterate through a vector is:

<-- SyntaxHighlightingPlugin -->
using namespace std;

vector<int> myIntVector;
// Add some elements to myIntVector
myIntVector.push_back(1);
myIntVector.push_back(4);
myIntVector.push_back(8);

for(int y=0; y<myIntVector.size(); y++)
{
    cout<<myIntVector[y]<<" ";
    //Should output 1 4 8
}
<-- end SyntaxHighlightingPlugin -->

However, this can only work if you know the indexes, and can specify them. What if you don't?

<-- SyntaxHighlightingPlugin -->
using namespace std;

vector<int> myIntVector;
vector<int>::iterator myIntVectorIterator;

// Add some elements to myIntVector
myIntVector.push_back(1);
myIntVector.push_back(4);
myIntVector.push_back(8);

for(myIntVectorIterator = myIntVector.begin(); 
        myIntVectorIterator != myIntVector.end();
        myIntVectorIterator++)
{
    cout<<*myIntVectorIterator<<" ";
    //Should output 1 4 8
}
<-- end SyntaxHighlightingPlugin -->
 
MAP template

A c++ map is a template in the STL the allows the association of keys and values. There is a documentation on map here. Map allow you to look up any kind of data using any kind of key, and can be any size while being very fast (using a hash lookup strategy). It is like a vector , but one where the index can be any type (not just an integer). An example of mapping from letters to strings is below:

Line: 106 to 151
  } %ENDCODE%
Added:
>
>
The complete code can be found here.
 
Inverted Index

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. \ No newline at end of file

Added:
>
>
Consider a book index: index_sample.jpg

Now consider a index of Shakespeare that has a map of words, and that the value for each is a vector of offsets in the file to where the word can be found.

Invertedindex.png

You will need to create a routine to read through each word in the file Shakspeare.txt, and place it in the MAP, along with its position in the file (or string) in the vector that the word maps to.

Assuming you have a have a routine that gets the next word in the file, and returns the word and it's integer position, the following simplified code sample can insert the each word into the map, and push the position at the end of the vector of positions. Note that is it is the first time the word was since, the vector will be empty, and the position will be the first entry. If the word was already seen, the position will be added on the end.

<-- SyntaxHighlightingPlugin -->
map<string, vector<int> > refs;  // The map of words/references
string word;
int position;
while (!infile.fail()) {
     getNextWord(infile, word, position);
     refs[word].push_back(position);
}
<-- end SyntaxHighlightingPlugin -->

Next we need a way to look up the words positions in the file, given a word. Following is a function that when pasted a word, returns the vector of references.

<-- SyntaxHighlightingPlugin -->
map<string, vector<int> > refs; // The map of words/references
vector <int> indexSearch(string word) {
     map<string, vector<int> >::iterator it;  // iterator for find                                                                
     vector<int> blank; // return for no matches                                                                                  
    /* find the word and get the vector of references */
    /* First use find, so as to NOT create a new entry */
    it = refs.find(word);
    if (it == refs.end()) {
         return(blank);
    } else {
        return (refs[word]);
    }
}
<-- end SyntaxHighlightingPlugin -->

The caller of indexSearch can take the vector returned, and iterate through it to get the positions of all the words.

Note that the indexSearch uses find method on the map. Why? The Map template has a strange operation feature in the if you use the indexdex form to look up, eg:

<-- SyntaxHighlightingPlugin -->
match = refs[word];
<-- end SyntaxHighlightingPlugin -->

It will create a new entry in the Map if there isn't one there. This is undesirable, as the indexSearch will thereafter find a match for that word, with an empty vector, even though it is not in the file.

The find returns an iterator to the Map, which we don't really need. So after we find it in the Map, we used the refs[word] to get the vector of offsets.

Stemming

When we search for a work, we typical want to find every version of the word. For example, it would be nice if group, groups, grouping, grouped, grouper all mapped the same index. How can we do with? With an English stemmer. a stemmer is a routine the attempts to remove the suffixes of words to find its "stem". This is an imperfect process, as English has many exceptions to simple stemming rules (such as run and ran, rather than run and runned), but we can greatly improve the functions with a simple stemmer.

A stemmer can be found in /home/class/SoftDev/wordStem. The code can be seen here.

Data Management

As mentioned above, there are two major methods of dealing with the text content. One is to read the entire text into a large string, and then work on that. While this is fast (no file I/O) it is not scable to larger document repositories. It also requires the perminate allocation of a larger data strucuture as long as the program is running. This methid is simpler to implement, and is the expected solution for student who have only had Intro to Programming.

The other solution is to not keep the file on memory, but rather to look up the needed data whenever a search is done.

How can this be done? By using the iostream functions tellf() and seekg(). These allow you to retrieve the current position in an open file, and to set the position of a file.

Function: tellg()
Description: Returns the position of the current character in the input stream.

<-- SyntaxHighlightingPlugin -->
ifstream is ("Shakespeare.txt");
   int pos = is.tellg(); 
   string line; 
   getline (cin,line); 
<-- end SyntaxHighlightingPlugin -->

Documentation: Link

The position (pos) will be set to the byte at the beginning of the read.

Function: seekg()
Description: Sets the position of the next character to be extracted from the input stream.

<-- SyntaxHighlightingPlugin -->
is.seekg (0, is.end); // Go to the end of the file
    is.seekg (0, is.beg); // Go to the beginning of the file
    is.seekg (pos, is.beg); // Go to pos from the beginning
    is.seekg (-1, is.cur); // Go to back one character
<-- end SyntaxHighlightingPlugin -->

Documentation: Link

It is recommended that you read in one line at a time using getline(), and then process the words on the line.

Displaying the line of a match

One issue is that, based on the suggested design, the position in the index is the position of the word. How do you find the beginning and end of the line that the word in on? One solution is to scan backwards one character at a timeuntil you either find a '\n' or the beginning of the file. This option is very appropriate is the text is stored in a string If it is in the file, then you can use

<-- SyntaxHighlightingPlugin -->
is.seekg (-2, is.cur); 
<-- end SyntaxHighlightingPlugin -->

to move backwards one character as a time. (-2 because -1 gets you back to the character just read)

Once you find the beginning of the line you can find the end by either scanning forward for a '\n' or end of string, or (if using the file) just doing a getline() on the file.

Another cool option is to use a vector off pairs of integers in the index. The one int will be the location of the word, the other the beginning of the line (e.g. <2034,2012> for the words at 2034, and that start of the line at 2012.) In order to do this you will have to make a structure of two ints.

Assignment

Below is a list of features you can put in your program, along with the points. Notice that people with Data Structures must do more than studnets with only the introduction course to get 100 points (the perfect score) on the program. If you do more than 100 points, you getr extra credit.

Requirement Data Structures Intro Prog Done
Program displays matching lines for each word entered* 30 50  
Input and output is well designed and formatted* 10 20  
Program broken up into functionally cohesive functions* 10 20  
Program has meaningful comments, good variable names, and good formatting* 10 10  
Program uses appropriate classes 10 10  
Program uses stemming 10 10  
Program works directly on files using seekg() 20 20  
Program reads file into program, and works on an array (large string) (mutually exclusive with above) 5 10  
Program shows book for each match is found in 20 20  
*These are required in order to get extra credit.

Turn in

  1. All source files you authored
  2. A writeup with the table above, and an indicated of which was included. Paste table into Moodle text field, and put X's in to options you attempted.
  3. Several runs showing all features.

META FILEATTACHMENT attachment="Invertedindex.png" attr="" comment="" date="1485138335" name="Invertedindex.png" path="Invertedindex.png" size="15336" user="JimSkon" version="1"
META FILEATTACHMENT attachment="index_sample.jpg" attr="" comment="" date="1485138402" name="index_sample.jpg" path="index_sample.jpg" size="56289" user="JimSkon" version="1"
META FILEATTACHMENT attachment="shake.gif" attr="" comment="" date="1485149004" name="shake.gif" path="shake.gif" size="37508" user="JimSkon" version="1"

Revision 32017-01-22 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Project 1

Shakespeare Lookup with Inverted Index
Line: 105 to 106
  } %ENDCODE%
Changed:
<
<
Inverted list
>
>
Inverted Index

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.

 \ No newline at end of file

Revision 22017-01-22 - JimSkon

Line: 1 to 1
 
META TOPICPARENT name="Spring2017"

Project 1

Shakespeare Lookup with Inverted Index
Line: 9 to 9
 

Methods

The file Shakespeare.txt is located on cslab.kenyon.edu in the directory /home/class/SoftDev/Shakespeare/ .You will need to write a program to read through ALL the words in the and create in inverted list of the locations of each of the words in a map structure. Then when someone searches for a word, the inde for this word can be looked up, and ALL the lines with the word can be printed.

Added:
>
>
Iterators

"The concept of an iterator is fundamental to understanding the C++ Standard Template Library (STL) because iterators provide a means for accessing data stored in container classes such a vector, map, list, etc."

Learn about itertors here: STL Iterators

 
MAP template

A c++ map is a template in the STL the allows the association of keys and values. There is a documentation on map here. Map allow you to look up any kind of data using any kind of key, and can be any size while being very fast (using a hash lookup strategy). It is like a vector , but one where the index can be any type (not just an integer). An example of mapping from letters to strings is below:

Line: 38 to 43
 } %ENDCODE%
Added:
>
>

Map demo to look up Name data

There is a good example of using map on the server in the program that allows the looking of up US Census data on how common names are in the US on the cslab server at: https://cslab.kenyon.edu/class/softdev/skon/name_stats_ajax.html.

The files for this project can be found here on the cslab server: /home/class/SoftDev/namedata.

The source code for a terminal version of the source code of the names data program is here: namesdemo.cpp

The program reads in lines of the US Census name data, as seens below:

MARY           2.629  2.629      1
PATRICIA       1.073  3.702      2
LINDA          1.035  4.736      3
BARBARA        0.980  5.716      4
ELIZABETH      0.937  6.653      5
JENNIFER       0.932  7.586      6
MARIA          0.828  8.414      7
SUSAN          0.794  9.209      8
MARGARET       0.768  9.976      9
DOROTHY        0.727 10.703     10
...
 
Changed:
<
<
There is also a good example of using map on the server in the program that allows the looking of up US Census data on how common names are in the US on the cslab server at: /home/class/SoftDev/namedata.
>
>
The following code reads a line of US Census data, and puts it in the map:
<-- SyntaxHighlightingPlugin -->
infile >> namedata.name;
            infile >> namedata.percent;
            infile >> namedata.cumulative;
            infile >> namedata.rank;
            if (infile.fail()) break;
            name_map[namedata.name] = namedata;
<-- end SyntaxHighlightingPlugin -->

The following code looks up the closest match in the Map, and points to it with an iterator:

<-- SyntaxHighlightingPlugin -->
it = lname_map.lower_bound(name);
<-- end SyntaxHighlightingPlugin -->

The following code then iterates through the closest 10 matches (5 before, 5 after).

<-- SyntaxHighlightingPlugin -->
// back up 5 places
            for (int i = 0; i < 5 && (it != curMap->begin()); i++) {
                it--;
            }

            // Get 10 results
            for (int i = 0; i < 10 && (it != curMap->end()); i++) {
                result = (*it).second;
                name = result.name;
                percent = result.percent;
                rank = result.rank;
                cout << name << "\t" << percent << "\t" << rank << endl;
                ;
                it++;

            }
<-- end SyntaxHighlightingPlugin -->
 
Inverted list

Revision 12017-01-20 - JimSkon

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="Spring2017"

Project 1

Shakespeare Lookup with Inverted Index
Due: Jan 30 - 11:55 pm

Goal

The goal of this program is to write a program to build an inverted index (using the c++ map STL) of every word in the complete works of Shakespeare. This will allow words to be searched for, and then to print out all of the lines that have that word.

Methods

The file Shakespeare.txt is located on cslab.kenyon.edu in the directory /home/class/SoftDev/Shakespeare/ .You will need to write a program to read through ALL the words in the and create in inverted list of the locations of each of the words in a map structure. Then when someone searches for a word, the inde for this word can be looked up, and ALL the lines with the word can be printed.

MAP template

A c++ map is a template in the STL the allows the association of keys and values. There is a documentation on map here. Map allow you to look up any kind of data using any kind of key, and can be any size while being very fast (using a hash lookup strategy). It is like a vector , but one where the index can be any type (not just an integer). An example of mapping from letters to strings is below:

<-- SyntaxHighlightingPlugin -->
// accessing mapped values
#include <iostream>
#include <map>
#include <string>

int main ()
{
  std::map<char,std::string> mymap;

  mymap['a']="an element";
  mymap['b']="another element";
  mymap['c']=mymap['b'];

  std::cout << "mymap['a'] is " << mymap['a'] << '\n';
  std::cout << "mymap['b'] is " << mymap['b'] << '\n';
  std::cout << "mymap['c'] is " << mymap['c'] << '\n';
  std::cout << "mymap['d'] is " << mymap['d'] << '\n';

  std::cout << "mymap now contains " << mymap.size() << " elements.\n";

  return 0;
}
<-- end SyntaxHighlightingPlugin -->

There is also a good example of using map on the server in the program that allows the looking of up US Census data on how common names are in the US on the cslab server at: /home/class/SoftDev/namedata.

Inverted list
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback