English tokenizer

July 17th, 2010

When I build statistical language models (e.g. bigrams and trigrams) trained on a particular corpus or some set of documents, firstly I feel like taking a look at some statistical properties of the set, such as the total number of tokens, the average number of tokens per sentence or per utterance, and so on. Any set of documents, even one consisting of a tremendous number of newspaper articles, is biased in some manner from a statistical perspective, mainly because people collect the data in a particular domain or domains.

Suppose that I need a list of unique words from English sentences described in a text file sentences.txt. Then my initial step is often to use a crude shell command like this:


$ cat sentences.txt | tr ' ' '\n' | sed '/^$/ d' | sort | uniq > unique_words.txt

The list I obtain with this command is neither tokenized nor lemmatized, but it could be sufficient for a quick analysis where I try to get a handle on approximately how many unique words occur in the target document.

If I need a more complicated analysis like extracting a list of unique words with their frequencies, the next step is likely to involve tokenization. A range of tokenization algorithms have so far been proposed according to respective natural languages. As for English, it seems that the simplest way is to build a tokenizer with regular expressions, as mentioned in Jurafsky and Martin 2000. For future reference, I will attach my Java code for English tokenization to this post.


import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

/**
 * Tokenizes strings described in English.
 *
 * @author Jun Araki
 */
public class EnglishTokenizer {
    /** A string to be tokenized. */
    private String str;

    /** Tokens. */
    private ArrayList<String> tokenList;

    /** A regular expression for letters and numbers. */
    private static final String regexLetterNumber = "[a-zA-Z0-9]";

    /** A regular expression for non-letters and non-numbers. */
    private static final String regexNotLetterNumber = "[^a-zA-Z0-9]";

    /** A regular expression for separators. */
    private static final String regexSeparator = "[\\?!()\";/\\|`]";

    /** A regular expression for separators. */
    private static final String regexClitics =
        "'|:|-|'S|'D|'M|'LL|'RE|'VE|N'T|'s|'d|'m|'ll|'re|'ve|n't";

    /** Abbreviations. */
    private static final List<String> abbrList =
        Arrays.asList("Co.", "Corp.", "vs.", "e.g.", "etc.", "ex.", "cf.",
            "eg.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.", "Jul.", "Aug.",
            "Sept.", "Oct.", "Nov.", "Dec.", "jan.", "feb.", "mar.",
            "apr.", "jun.", "jul.", "aug.", "sept.", "oct.", "nov.",
            "dec.", "ed.", "eds.", "repr.", "trans.", "vol.", "vols.",
            "rev.", "est.", "b.", "m.", "bur.", "d.", "r.", "M.", "Dept.",
            "MM.", "U.", "Mr.", "Jr.", "Ms.", "Mme.", "Mrs.", "Dr.",
            "Ph.D.");

    /**
     * Constructs a string to be tokenized and an empty list for tokens.
     *
     * @param  str  a string to be tokenized
     */
    public EnglishTokenizer(String str) {
        this.str = str;
        tokenList = new ArrayList<String>();
    }

    /**
     * Tokenizes a string using the algorithms by Grefenstette (1999) and
     * Palmer (2000).
     */
    public void tokenize() {
        // Changes tabs into spaces.
        str = str.replaceAll("\\t", " ");

        // Put blanks around unambiguous separators
        str = str.replaceAll("(" + regexSeparator + ")", " $1 ");

        // Put blanks around commas that are not inside numbers
        str = str.replaceAll("([^0-9]),", "$1 , ");
        str = str.replaceAll(",([^0-9])", " , $1");

        // Distinguishes single quotes from apstrophes by segmenting off
        // single quotes not preceded by letters
        str = str.replaceAll("^(')", "$1 ");
        str = str.replaceAll("(" + regexNotLetterNumber + ")'", "$1 '");

        // Segments off unambiguous word-final clitics and punctuations
        str = str.replaceAll("(" + regexClitics + ")$", " $1");
        str = str.replaceAll(
                "(" + regexClitics + ")(" + regexNotLetterNumber + ")",
                " $1 $2");

        // Deals with periods.
        String[] words = str.trim().split("\\s+");
        Pattern p1 = Pattern.compile(".*" + regexLetterNumber + "\\.");
        Pattern p2 = Pattern.compile(
            "^([A-Za-z]\\.([A-Za-z]\\.)+|[A-Z][bcdfghj-nptvxz]+\\.)$");
        for (String word : words) {
            Matcher m1 = p1.matcher(word);
            Matcher m2 = p2.matcher(word);
            if (m1.matches() && !abbrList.contains(word) && !m2.matches()) {
                // Segments off the period.
                tokenList.add(word.substring(0, word.length() - 1));
                tokenList.add(word.substring(word.length() - 1));
            } else {
                tokenList.add(word);
            }
        }
    }

    /**
     * Returns tokenized strings.
     *
     * @return  a list of tokenized strings
     */
    public String[] getTokens() {
        String[] tokens = new String[tokenList.size()];
        tokenList.toArray(tokens);
        return tokens;
    }
}

References:

Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Prentice Hall Series in Artificial Intelligence). Prentice Hall.

Gregory Grefenstette. 1999. Tokenization. In van Halteren, H. (Ed.), Syntactic Wordclass Tagging. Kluwer.

David D. Palmer. 2000. Tokenisation and sentence segmentation. In Dale, R., Moisl, H., and Somers, H. L. (Eds.), Handbook of Natural Language Processing. Marcel Dekker.

Japan Day

May 15th, 2010

Japan Day is an event held at the Bechtel International Center at Stanford University on Saturday, May 8 by Stanford Japanese Association (SJA) and Stanford University Nikkei (SUN). Since I was interested in what was going on in the event but had little time to enjoy it on that day, I just dropped by after lunch and walked around to look at some demonstrations.

There were several corners for presenting Japanese culture. Out of them, the tea ceremony seemed to be the most popular. Two Japanese women wearing traditional garment, kimonos, demonstrated how to make and have Japanese tea in a formal way. In addition, some Japanese people tried to teach how to write Japanese calligraphy, shodo, and how to do Japanese paper folding, origami. Just being there for about ten minutes, I was struck by a sense of nostalgia.


Two Japanese women demonstrating how to make and have Japanese tea.


Some ornaments for the Boy’s Festival in Japan.

Running in the U.S.

March 23rd, 2010

We finished final exams for the winter quarter last week, and now we have a one-week spring break. Yesterday I started running again. I ran around campus just for a while, and felt great after running. A few months ago, thankfully my father sent me my sportswear and armband for an iPod shuffle that I had been using when running in Tokyo. So I could enjoy running here at Stanford just as I had been doing there.

As I wrote in a previous post, some people around Stanford University are very active. I always see several people enjoying their exercise on campus. Some of them are such avid runners that they push a baby carriage with their baby or babies while running, though I think this is a little dangerous.

Once the spring quarter begins, probably I will be very busy again. But I’d like to enjoy myself doing exercise as much as possible. Incidentally, I am somewhat interested in the general relationships between physical activities and our brains. A suggestive (but informal) article on this topic is from PhDs.org: PE for grad students.

Reading a text file with Java

February 20th, 2010

Recently I regularly use Java in some classes and my research. In particular, I often implement a similar code to read a text file for the purpose of some text processing. So I will attach the trivial code to this post for future reference. I confirmed that it works with Java 1.6.0_16.


import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileNotFoundException;
import java.io.IOException;

class SomeClass {
    public static void readFromFile(String filename) {
        BufferedReader fin = null;

        try {
            fin = new BufferedReader(new FileReader(filename));
            String line = null;
            while ((line = fin.readLine()) != null) {
                // Do something
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (fin != null) fin.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public static void main(String args[]) {
        readFromFile("sample.txt");
    }
}

Incidentally I have used BufferedReader rather than BufferedInputStream simply because BufferedReader is more suitable to the text processing that I need to do right now.

CSV to HTML converter

February 13th, 2010

I have spent some time implementing a small script with Python 2.6.2 to help my trivial work: concerting CSV to HTML (more precisely a CSV file to an HTML table). The CSV format for its input depends on what the csv module in Python specifies. The code is pretty straightforward:


#!/usr/bin/python

# csv2html.py
# CSV to HTML Converter

import csv
import sys

table_indent_num = 2
tr_indent_num = 4
td_indent_num = 6
white_space = " "

def main():
    csv_reader = csv.reader(open(sys.argv[1]))
    table_indent = white_space * table_indent_num
    tr_indent = white_space * tr_indent_num
    td_indent = white_space * td_indent_num

    print table_indent + "<table>"
    for i, row in enumerate(csv_reader):
        print tr_indent + "<tr>"

        # Uncomment the following two lines if you don't
        # want a column for indexes
        if i == 0: print td_indent + "<th>#</th>"
        else: print td_indent + "<td>" + str(i) + "</td>"

        # Assume that the first line is a header
        for column in row:
            if i == 0: print td_indent + "<th>" + column + "</th>"
            else: print td_indent + "<td>" + column + "</td>"

        print tr_indent + "</tr>"

    print table_indent + "</table>"

if __name__ == "__main__":
    argn = len(sys.argv)
    if argn != 2:
        print "Usage: python csv2html.py <CSV file>"
        exit(1)

    main()

An example of usage is as follows:


$ more sample.csv
title1,title2,title3
"test11",test12,test13
test21,"test,22",test23
$ python csv2html.py sample.csv > sample.html
$ more sample.html
  <table>
    <tr>
      <th>#</th>
      <th>title1</th>
      <th>title2</th>
      <th>title3</th>
    </tr>
    <tr>
      <td>1</td>
      <td>test11</td>
      <td>test12</td>
      <td>test13</td>
    </tr>
    <tr>
      <td>2</td>
      <td>test21</td>
      <td>test,22</td>
      <td>test23</td>
    </tr>
  </table>

Research methods in computer science

December 27th, 2009

During this winter break, I have been reading two books on doing research and publishing papers besides some textbooks for my classes next quarter. One book is Wayne C. Booth, Gregory G. Colomb, and Joseph M. Williams, The Craft of Research (3rd Edition), and the other is Robert A. Day and Barbara Gastel, How to Write and Publish a Scientific Paper (6th Edition). Though I have published a few papers so far, these books are of benefit to me in that I can regain an appreciation of appropriate ways of research.

When reading these books, I came to think of research methods particularly in the field of my major, computer science. Probably the only way to truly acquire the methods is to go through several years of actual research in computer science, but it is good to know some methodologies that are systematized to some extent. I found out about useful information on Dr. Vasant Honavar’s website: Graduate Research, Writing, and Careers in Computer Science. This web page contains a whole bunch of helpful links over various topics for computer science graduate students like me. I would be appreciate if you would be willing to leave your comments about other information on these topics.

My first Thanksgiving Day

November 27th, 2009

Today was my first Thanksgiving Day. The Stanford Graduate Student Council provided free Thanksgiving dinner, and I was happy to join the event. Of course, this was the first time for me to eat traditional Thanksgiving dishes such as turkey and pumpkin pie, but I liked them. It seemed like a number of first-year international students joined the event, including me. I got to know some students around me, and enjoyed a little chat with them. Since my last two months were really hectic, this event was a good relaxing time for me.

My classes and homework

October 24th, 2009

My classes started late last month. It is approximately seven years since I took regular classes at the University of Tokyo. There is a substantial difference in the amount of homework assignments between the two, although I knew this before taking classes. Sometimes they are really hard, but also valuable intellectual excitement.

Japanese cuisine in California

September 10th, 2009

The other day I eventually missed Japanese cuisine because I hadn’t eaten any Japanese food since I entered the United States on July 26. So when I fortunately got some help to go to the Japantown in San Jose last Saturday, I was happy to purchase some Japanese food and enjoy tofu cuisine in a restaurant there.

On the following day, I cooked rice and pacific sauries myself. They are typical Japanese autumnal fish called “sanma” in Japan. I was really astonished at the taste of the rice because it was exactly the same as the one in Japan. This is in part because of my Zojiruji rice cooker which I bought online after arriving here, but probably it depends greatly on the result of some breed improvement in Californian rice. My roommate and his friends were also pleased with my cooking. Here is a picture of the cooking:

I know that once my classes start on September 21, I will be so busy that I may not make much time to cook my own food. I, however, would like to continue to do so as far as possible because basically I like eating at home and believe that Japanese food is the secret of the longevity of Japanese people.

My life in Palo Alto

September 3rd, 2009

I have lived in graduate housing on campus at Stanford University since the end of last month, and enjoyed the process of organizing my life here little by little. Usually I get up in the morning, and study English and computer science, and cook some simple dishes, and sometimes go to some stores around campus by bike. Of course, I have been doing a range of things besides these to settle in this place, and take care of some administrative things for the university, and prepare for my study and so on. Last night I enjoyed chatting with my family in Tokyo on Skype, and was amazed at their technology which offered high speech quality and little time difference in our speeches between California and Tokyo.

I recently noticed that some people around Stanford University are very active. While riding my bike on and off campus, I always see several people running or riding their bikes just for exercise. Stanford is teeming with natural treasures such as a lake, trees, birds and even squirrels, and has pedestrian-and-bike-friendly campus on top of that, so all those people probably find pleasure in their daily exercise. When I have more free time someday, I would like to enjoy exercise just like them.