Archive for the ‘Research’ Category

NAACL-HLT 2013

Sunday, July 14th, 2013

I attended NAACL-HLT 2013 (The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies) which took place last month in Atlanta, GA. I presented a poster at the workshop on Events: Definition, Detection, Coreference, and Representation in the conference. This was the first time that I had ever participated in an international conference. Listening to interesting talks and speaking with researchers in such conference were a fresh experience for me to cultivate research skills. I am grateful that the conference committee organized this wonderful event and put things together.

Here are a couple of pictures of the conference:


A welcome board for the conference


The Westin Peechtree Plaza hotel where the conference was held

English tokenizer

Saturday, July 17th, 2010

When I build statistical language models (e.g., bigrams and trigrams) trained on a particular corpus or some set of documents, firstly I feel like taking a look at some statistical properties of the set, such as the total number of tokens, the average number of tokens per sentence or per utterance, and so on. Any set of documents, even one consisting of a tremendous number of newspaper articles, is biased in some manner from a statistical perspective, mainly because people collect the data in a particular domain or domains.

Suppose that I need a list of unique words from English sentences described in a text file sentences.txt. Then my initial step is often to use a crude shell command like this:

$ cat sentences.txt | tr ' ' '\n' | sed '/^$/ d' | sort | uniq > unique_words.txt

The list I obtain with this command is neither tokenized nor lemmatized, but it could be sufficient for a quick analysis where I try to get a handle on approximately how many unique words occur in the target document.

If I need a more complicated analysis like extracting a list of unique words with their frequencies, the next step is likely to involve tokenization. A range of tokenization algorithms have so far been proposed according to respective natural languages. As for English, it seems that the simplest way is to build a tokenizer with regular expressions, as mentioned in Jurafsky and Martin 2000. For future reference, I will attach my Java code for English tokenization to this post.

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

/**
 * Tokenizes strings described in English.
 * 
 * @author Jun Araki
 */
public class EnglishTokenizer {
    /** A string to be tokenized. */
    private String str;

    /** Tokens. */
    private ArrayList<String> tokenList;

    /** A regular expression for letters and numbers. */
    private static final String regexLetterNumber = "[a-zA-Z0-9]";

    /** A regular expression for non-letters and non-numbers. */
    private static final String regexNotLetterNumber = "[^a-zA-Z0-9]";

    /** A regular expression for separators. */
    private static final String regexSeparator = "[\\?!()\";/\\|`]";

    /** A regular expression for separators. */
    private static final String regexClitics =
        "'|:|-|'S|'D|'M|'LL|'RE|'VE|N'T|'s|'d|'m|'ll|'re|'ve|n't";

    /** Abbreviations. */
    private static final List<String> abbrList =
        Arrays.asList("Co.", "Corp.", "vs.", "e.g.", "etc.", "ex.", "cf.",
            "eg.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.", "Jul.", "Aug.",
            "Sept.", "Oct.", "Nov.", "Dec.", "jan.", "feb.", "mar.",
            "apr.", "jun.", "jul.", "aug.", "sept.", "oct.", "nov.",
            "dec.", "ed.", "eds.", "repr.", "trans.", "vol.", "vols.",
            "rev.", "est.", "b.", "m.", "bur.", "d.", "r.", "M.", "Dept.",
            "MM.", "U.", "Mr.", "Jr.", "Ms.", "Mme.", "Mrs.", "Dr.",
            "Ph.D.");

    /**
     * Constructs a string to be tokenized and an empty list for tokens.
     * 
     * @param  str  a string to be tokenized
     */
    public EnglishTokenizer(String str) {
        this.str = str;
        tokenList = new ArrayList<String>();
    }

    /**
     * Tokenizes a string using the algorithms by Grefenstette (1999) and
     * Palmer (2000).
     */
    public void tokenize() {
        // Changes tabs into spaces.
        str = str.replaceAll("\\t", " ");

        // Puts blanks around unambiguous separators.
        str = str.replaceAll("(" + regexSeparator + ")", " $1 ");

        // Puts blanks around commas that are not inside numbers.
        str = str.replaceAll("([^0-9]),", "$1 , ");
        str = str.replaceAll(",([^0-9])", " , $1");

        // Distinguishes single quotes from apstrophes by segmenting off
        // single quotes not preceded by letters.
        str = str.replaceAll("^(')", "$1 ");
        str = str.replaceAll("(" + regexNotLetterNumber + ")'", "$1 '");

        // Segments off unambiguous word-final clitics and punctuations.
        str = str.replaceAll("(" + regexClitics + ")$", " $1");
        str = str.replaceAll(
                "(" + regexClitics + ")(" + regexNotLetterNumber + ")",
                " $1 $2");

        // Deals with periods.
        String[] words = str.trim().split("\\s+");
        Pattern p1 = Pattern.compile(".*" + regexLetterNumber + "\\.");
        Pattern p2 = Pattern.compile(
            "^([A-Za-z]\\.([A-Za-z]\\.)+|[A-Z][bcdfghj-nptvxz]+\\.)$");
        for (String word : words) {
            Matcher m1 = p1.matcher(word);
            Matcher m2 = p2.matcher(word);
            if (m1.matches() && !abbrList.contains(word) && !m2.matches()) {
                // Segments off the period.
                tokenList.add(word.substring(0, word.length() - 1));
                tokenList.add(word.substring(word.length() - 1));
            } else {
                tokenList.add(word);
            }
        }
    }

    /**
     * Returns tokenized strings.
     * 
     * @return  a list of tokenized strings
     */
    public String[] getTokens() {
        String[] tokens = new String[tokenList.size()];
        tokenList.toArray(tokens);
        return tokens;
    }
}

References:

Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Prentice Hall Series in Artificial Intelligence). Prentice Hall.

Gregory Grefenstette. 1999. Tokenization. In van Halteren, H. (Ed.), Syntactic Wordclass Tagging. Kluwer.

David D. Palmer. 2000. Tokenisation and sentence segmentation. In Dale, R., Moisl, H., and Somers, H. L. (Eds.), Handbook of Natural Language Processing. Marcel Dekker.

Research methods in computer science

Sunday, December 27th, 2009

During this winter break, I have been reading two books on doing research and publishing papers besides some textbooks for my classes next quarter. One book is Wayne C. Booth, Gregory G. Colomb, and Joseph M. Williams, The Craft of Research (3rd Edition), and the other is Robert A. Day and Barbara Gastel, How to Write and Publish a Scientific Paper (6th Edition). Though I have published a few papers so far, these books are of benefit to me in that I can regain an appreciation of appropriate ways of research.

When reading these books, I came to think of research methods particularly in the field of my major, computer science. Probably the only way to truly acquire the methods is to go through several years of actual research in computer science, but it is good to know some methodologies that are systematized to some extent. I found out about useful information on Dr. Vasant Honavar’s website: Graduate Research, Writing, and Careers in Computer Science. This web page contains a whole bunch of helpful links over various topics for computer science graduate students like me. I would be appreciate if you would be willing to leave your comments about other information on these topics.

Academic Earth

Sunday, March 29th, 2009

I found out about a website which aggregates lecture videos in a variety of fields from top universities in the United States. The website is Academic Earth. It has more than 1,500 videos on 17 subjects right now. Since a few years ago, I have known some websites delivering lecture videos in specific fields or podcastings for various lectures at some universities. But such aggregation in Academic Earth is novel for me, and I am interested in the ratings of courses, lectures and instructors over prestigious universities such as MIT and Stanford. I appreciate their mission statement advocating “the goal of giving everyone on earth access to a world-class education.”

Useful links on writing papers

Thursday, January 29th, 2009

I described a book about writing in English in the last post. In association with that, I have searched some websites which seem good for me to learn how to write a paper. Although you get millions of website links if you ask Google inputting keywords like “how to write a paper,” I believe that some websites of university professors and researchers are truly useful, which are filled with valuable advice based on their many years’ experience.

Here are some links:

Please tell me other helpful websites that you can recommend.