NAACL-HLT 2013

July 14th, 2013

I attended NAACL-HLT 2013 (The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies) which took place last month in Atlanta, GA. I presented a poster at the workshop on Events: Definition, Detection, Coreference, and Representation in the conference. This was the first time that I had ever participated in an international conference. Listening to interesting talks and speaking with researchers in such conference were a fresh experience for me to cultivate research skills. I am grateful that the conference committee organized this wonderful event and put things together.

Here are a couple of pictures of the conference:


A welcome board for the conference


The Westin Peechtree Plaza hotel where the conference was held

Java Code for Feature Selection

July 6th, 2012

Feature selection is the technique of selecting a subset of relevant features for building robust learning models, as described in a Wikipedia article. We need to go through the process basically whenever we apply a machine learning model to a certain task such as classification or regression. There are various approaches to feature selection in general, but mutual information, chi-square, and information gain are common metrics to figure out how effectively an individual feature characterizes a particular class in natural language processing tasks such as text classification.

Just for future reference, I will put my Java code for calculating those scores below.

[2013/11/13 Refactord code and fixed an error in information gain]

/**
 * This class counts the number of feature occurrences given a particular class.
 * Specifically, it assume the following confusion matrix for feature selection.
 *
 *                           Feature occurrence
 *                           yes  no
 * Gold standard class  yes  tp   fn
 *                      no   fp   tn
 *
 * In this matrix, tp, fn, fp, and tn stand for the number of true positive,
 * false negative, false positive, and true negative, respectively.  The
 * variable n is the sum of tp, fn, fp, and tn.
 *
 * @author Jun Araki
 */
public class FeatureOccurrenceCounter {

  protected double tp;

  protected double fn;

  protected double fp;

  protected double tn;

  protected double n;

  /**
   * Constructor.
   */
  public FeatureOccurrenceCounter() {
    tp = 0.0;
    fn = 0.0;
    fp = 0.0;
    tn = 0.0;
  }

  /**
   * Constructor with respective counts.
   *
   * @param tp
   * @param fn
   * @param fp
   * @param tn
   */
  public FeatureOccurrenceCounter(double tp, double fn, double fp, double tn) {
    this.tp = tp;
    this.fn = fn;
    this.fp = fp;
    this.tn = tn;
  }

  public void calculateSum() {
    n = tp + fn + fp + tn;
  }

  public void incrementTruePositive() {
    incrementTruePositive(1);
  }

  public void incrementTruePositive(double delta) {
    tp += delta;
  }

  public void incrementFalseNegative() {
    incrementFalseNegative(1);
  }

  public void incrementFalseNegative(double delta) {
    fn += delta;
  }

  public void incrementFalsePositive() {
    incrementFalsePositive(1);
  }

  public void incrementFalsePositive(double delta) {
    fp += delta;
  }

  public void incrementTrueNegative() {
    incrementTrueNegative(1);
  }

  public void incrementTrueNegative(double delta) {
    tn += delta;
  }

  public double getTruePositive() {
    return tp;
  }

  public void setTruePositive(double tp) {
    this.tp = tp;
  }

  public double getFalseNegative() {
    return fn;
  }

  public void setFalseNegative(double fn) {
    this.fn = fn;
  }

  public double getFalsePositive() {
    return fp;
  }

  public void setFalsePositive(double fp) {
    this.fp = fp;
  }

  public double getTrueNegative() {
    return tn;
  }

  public void setTrueNegative(double tn) {
    this.tn = tn;
  }

  public double getGoldStandardPositives() {
    return (tp + fn);
  }

  public double getGoldStandardNegatives() {
    return (fp + tn);
  }

  public double getPredictedPositives() {
    return (tp + fp);
  }

  public double getPredictedNegatives() {
    return (fn + tn);
  }

  public double getAll() {
    calculateSum();
    return n;
  }

}
/**
 * This class gives the following popular metrics for feature selection.
 * <ul>
 *   <li>Mutual Information (MI)
 *   <li>Chi-square
 *   <li>Information Gain (IG)
 * </ul>
 *
 * In order to calculate the scores above, it needs to first count the number
 * of feature occurrences.  Specifically, it assumes the following confusion
 * matrix for feature selection.
 *
 *                           Feature occurrence
 *                           yes  no
 * Gold standard class  yes  tp  fn
 *                      no   fp  tn
 *
 * In this matrix, tp, fn, fp, and tn stand for the number of true positive,
 * false negative, false positive, and true negative, respectively.  The
 * variable n is the sum of tp, fn, fp, and tn.  For more information on
 * feature selection, see:
 *
 * Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. 2008.
 * Introduction to Information Retrieval. Cambridge University Press.
 *
 * George Forman, Isabelle Guyon, and Andr Elisseeff. 2003. An Extensive
 * Empirical Study of Feature Selection Metrics for Text Classification.
 * Journal of Machine Learning Research, 3:1289-1305.
 *
 * @author Jun Araki
 */
public class FeatureSelectionMetrics extends FeatureOccurrenceCounter {

  /** Mutual information score */
  private Double mi;

  /** Chi-square score */
  private Double chiSquare;

  /** Information gain score */
  private Double ig;

  /**
   * Constructor.
   */
  public FeatureSelectionMetrics() {
    super();
  }

  /**
   * Constructor taking respective counts.
   *
   * @param tp
   * @param fn
   * @param fp
   * @param tn
   */
  public FeatureSelectionMetrics(double tp, double fn, double fp, double tn) {
    super(tp, fn, fp, tn);
  }

  /**
   * Calculates and returns the mutual information score.
   *
   * @return the mutual information score
   */
  public Double getMI() {
    calculateMutualInformation();
    return mi;
  }

  /**
   * Calculates and returns the chi-square score.
   *
   * @return the chi-square score
   */
  public Double getChiSquare() {
    calculateChiSquare();
    return chiSquare;
  }

  /**
   * Calculates and returns the information gain score.
   *
   * @return the information gain score
   */
  public Double getIG() {
    calculateInformationGain();
    return ig;
  }

  /**
   * Calculates mutual information given the counts from tp to tn.  For more
   * information, see (Manning et al., 2008).
   */
  private void calculateMutualInformation() {
    if (tp == 0 || fn == 0 || fp == 0 || tn == 0) {
      // Boundary cases
      mi = null;
      return;
    }

    calculateSum();
    double gPos = getGoldStandardPositives();
    double gNeg = getGoldStandardNegatives();
    double fPos = getPredictedPositives();
    double fNeg = getPredictedNegatives();

    mi = (tp / n) * log2((n * tp) / (gPos * fPos))
       + (fp / n) * log2((n * fp) / (gNeg * fPos))
       + (fn / n) * log2((n * fn) / (gPos * fNeg))
       + (tn / n) * log2((n * tn) / (gNeg * fNeg));
  }

  /**
   * Calculates the chi-square score given the counts from tp to tn.  In
   * order to know statistical significance of the score, you can refer to
   * the following relationship between the p value and chi-square score
   * (Manning et al., 2008).
   *
   * p value   chi-square
   * 0.1       2.71
   * 0.05      3.84
   * 0.01      6.63
   * 0.005     7.88
   * 0.001     10.83
   */
  private void calculateChiSquare() {
    if (tp + fp == 0 || tp + fn == 0 || fn + tn == 0 || fp + tn == 0) {
      // Boundary cases.
      chiSquare = null;
      return;
    }

    calculateSum();
    // An arithmetically simpler way of computing chi-square
    chiSquare = (n * Math.pow((tp * tn - fn * fp), 2))
                / ((tp + fp) * (tp + fn) * (fn + tn) * (fp + tn));
  }

  /**
   * Calculates the information gain score given the counts from tp to tn.
   * For more information, see (Forman et al., 2003).
   */
  private void calculateInformationGain() {
    if (tp == 0 || fn == 0 || fp == 0 || tn == 0) {
      // Boundary cases
      ig = null;
      return;
    }

    calculateSum();
    double gPos = getGoldStandardPositives();
    double gNeg = getGoldStandardNegatives();
    double fPos = getPredictedPositives();
    double fNeg = getPredictedNegatives();

    // Information gain = (entropy when a feature is absent) - (entropy when a feature is present)
    ig = - (gPos / n) * log2 (gPos / n) - (gNeg / n) * log2 (gNeg / n)
         + (tp / n) * log2 (tp / fPos) + (fp / n) * log2 (fp / fPos)
         + (fn / n) * log2 (fn / fNeg) + (tn / n) * log2 (tn / fNeg);
  }

  private double log2(double value) {
    return (Math.log(value) / Math.log(2));
  }

  /**
   * A simple tester with a couple of examples.
   *
   * @param args
   */
  public static void main(String[] args) {
    FeatureSelectionMetrics fsm1 = new FeatureSelectionMetrics(49, 141, 27652, 774106);
    Double mi1 = fsm1.getMI();
    Double chiSquare1 = fsm1.getChiSquare();
    Double ig1 = fsm1.getIG();

    FeatureSelectionMetrics fsm2 = new FeatureSelectionMetrics(0, 4, 0, 164);
    Double mi2 = fsm2.getMI();
    Double chiSquare2 = fsm2.getChiSquare();
    Double ig2 = fsm2.getIG();

    System.out.println("mi1: " + mi1);  // Should be approximately 0.0001105
    System.out.println("chiSquare1: " + chiSquare1);  // Should be approximately 284
    System.out.println("ig1: " + ig1);

    // The scores below should be undefined (null) due to boundary cases.
    System.out.println("mi2: " + mi2);
    System.out.println("chiSquare2: " + chiSquare2);
    System.out.println("ig2: " + ig2);
  }

}

My Life in Pittsburgh

March 10th, 2012

Today I will write a little about my life here in Pittsburgh, PA with several pictures. Pittsburgh is the third place for me to live in the U.S. I have just remembered that I wrote a post about my life in the first place (Berkeley, CA), and one about my life in the second place (Palo Alto, CA) before. I have been around in Pittsburgh for approximately seven months since I moved from California last August. This period was really hectic for me to get used to everything, but even so, I have found a couple of things that I like about Pittsburgh.

The first one is that Pittsburgh has a good public transportation system, particularly the bus service provided by the Port Authority of Allegheny County. They have such many bus lines that I can use them for various purposes, including my commute to Carnegie Mellon University (CMU). Although Stanford University provides shuttle service comparable to the similar one by CMU, I didn’t find such convenience in public bus service around Stanford.

The weather in Pittsburgh is also unexpectedly benign. It is colder in Pittsburgh than in California, of course. We do have some snow in winter, but by and large it has been exceptionally warm this winter. I will show you some pictures of Pittsburgh below.

These days I have been gearing up for the Ph.D. program of the Language Technologies Institute at CMU, which I will be enrolling in this August. While making efforts to prepare for the program, I would like to find more goodness and fun here in Pittsburgh for a change.


The Gates Hillman Complex (GHC) where I am currently working


The GHC building that you can see from the other side of the Randy Pausch Memorial Footbridge


The Cathedral of Learning at the University of Pittsburgh


The Heinz Memorial Chapel close to the Cathedral of Learning


A spacious and gentle slope on campus at Chatham University


A scene in the Pittsburgh downtown area

Libtree 0.1 released

June 20th, 2011

A tree is one of the most important and fundamental data structures in computer science. It is so fundamental that I have so far come across that data structure in my research and various classes quite frequently. As such, I have decided to release my own Java implementation of an abstract tree data structure and some associated operations that I think might be useful in different situations. You can see more information about the implementation, and download the code in the following web page: http://junaraki.net/software/libtree/ . I made the initial release as simple as possible, and the essential code just involves five Java files; I’m planning to enhance the software little by little as well as its documentation.

The commencement at Stanford

June 16th, 2011

Last Sunday I attended the 120th Stanford Commencement at the Stanford Stadium. This was the first time that I attended a commencement in the U.S. The stadium was filled with students’ families and friends, and I really felt great when standing in the field. The commencement had a hopeful atmosphere along with California’s typical sunny weather.

On the same day, I also attended the Computer Science Departmental diploma ceremony on the lawn of the Gates Building. I received a diploma of Master of Science in computer science. The very moment of the reception was the most impressive to me on that day. I got a little tan with the sunny weather, by the way.

Writing something into a text file with Java

October 27th, 2010

In relation to a previous post about the code for reading a text file with Java, I will attach another (trivial) code for writing something into a text file. This post is also just for future reference. I have confirmed that it works with Java 1.6.0_18.

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.FileNotFoundException;
import java.io.IOException;

class SomeClass {
    public static void writeFile(String filename) {
        BufferedWriter fout = null;

        try {
            fout = new BufferedWriter(new FileWriter(filename));

            while (/* Some condition */) {
                // Write something.
                StringBuffer lineBuffer = new StringBuffer();
                lineBuffer.append("abc");
                lineBuffer.append("def");

                fout.write(lineBuffer.toString());
                fout.newLine();
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (fout != null) fout.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public static void main(String args[]) {
        writeFile("sample.txt");
    }
}

Point Reyes National Seashore

September 19th, 2010

I took a one day trip to Point Reyes National Seashore for my break yesterday, the Saturday right before the autumn quarter begins. For some reason I wanted to see the Pacific ocean; I occasionally feel like seeing some spacious sceneries for a change. Point Reyes National Seashore is one of the park preserves that are not so far from Stanford University.

Unfortunately the weather wasn’t good; it was cloudy and foggy. But still I enjoyed taking a walk around Point Reyes Station and the Lighthouse. Here are several pictures:


There are several shops in Point Reyes Station. A store clerk in the bookstore in the middle of this picture was kind enough to give me a free map for the seashore and tell me the way to my destination, the Lighthouse.


This is Point Reyes Beach that goes along with the seashore. I wish it had been sunny.


You can come across wild animals like this deerlet.


These are trees which I think were probably bent due to the wind from the ocean.


The Lighthouse was encompassed with a fog. If it is sunny, one should be able to see the roundness of the earth with the sea horizon.


I saw some water birds pursuing fish in a marshland on my way back home from the Lighthouse.

English tokenizer

July 17th, 2010

When I build statistical language models (e.g., bigrams and trigrams) trained on a particular corpus or some set of documents, firstly I feel like taking a look at some statistical properties of the set, such as the total number of tokens, the average number of tokens per sentence or per utterance, and so on. Any set of documents, even one consisting of a tremendous number of newspaper articles, is biased in some manner from a statistical perspective, mainly because people collect the data in a particular domain or domains.

Suppose that I need a list of unique words from English sentences described in a text file sentences.txt. Then my initial step is often to use a crude shell command like this:

$ cat sentences.txt | tr ' ' '\n' | sed '/^$/ d' | sort | uniq > unique_words.txt

The list I obtain with this command is neither tokenized nor lemmatized, but it could be sufficient for a quick analysis where I try to get a handle on approximately how many unique words occur in the target document.

If I need a more complicated analysis like extracting a list of unique words with their frequencies, the next step is likely to involve tokenization. A range of tokenization algorithms have so far been proposed according to respective natural languages. As for English, it seems that the simplest way is to build a tokenizer with regular expressions, as mentioned in Jurafsky and Martin 2000. For future reference, I will attach my Java code for English tokenization to this post.

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

/**
 * Tokenizes strings described in English.
 *
 * @author Jun Araki
 */
public class EnglishTokenizer {
    /** A string to be tokenized. */
    private String str;

    /** Tokens. */
    private ArrayList<String> tokenList;

    /** A regular expression for letters and numbers. */
    private static final String regexLetterNumber = "[a-zA-Z0-9]";

    /** A regular expression for non-letters and non-numbers. */
    private static final String regexNotLetterNumber = "[^a-zA-Z0-9]";

    /** A regular expression for separators. */
    private static final String regexSeparator = "[\\?!()\";/\\|`]";

    /** A regular expression for separators. */
    private static final String regexClitics =
        "'|:|-|'S|'D|'M|'LL|'RE|'VE|N'T|'s|'d|'m|'ll|'re|'ve|n't";

    /** Abbreviations. */
    private static final List<String> abbrList =
        Arrays.asList("Co.", "Corp.", "vs.", "e.g.", "etc.", "ex.", "cf.",
            "eg.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.", "Jul.", "Aug.",
            "Sept.", "Oct.", "Nov.", "Dec.", "jan.", "feb.", "mar.",
            "apr.", "jun.", "jul.", "aug.", "sept.", "oct.", "nov.",
            "dec.", "ed.", "eds.", "repr.", "trans.", "vol.", "vols.",
            "rev.", "est.", "b.", "m.", "bur.", "d.", "r.", "M.", "Dept.",
            "MM.", "U.", "Mr.", "Jr.", "Ms.", "Mme.", "Mrs.", "Dr.",
            "Ph.D.");

    /**
     * Constructs a string to be tokenized and an empty list for tokens.
     *
     * @param  str  a string to be tokenized
     */
    public EnglishTokenizer(String str) {
        this.str = str;
        tokenList = new ArrayList<String>();
    }

    /**
     * Tokenizes a string using the algorithms by Grefenstette (1999) and
     * Palmer (2000).
     */
    public void tokenize() {
        // Changes tabs into spaces.
        str = str.replaceAll("\\t", " ");

        // Puts blanks around unambiguous separators.
        str = str.replaceAll("(" + regexSeparator + ")", " $1 ");

        // Puts blanks around commas that are not inside numbers.
        str = str.replaceAll("([^0-9]),", "$1 , ");
        str = str.replaceAll(",([^0-9])", " , $1");

        // Distinguishes single quotes from apstrophes by segmenting off
        // single quotes not preceded by letters.
        str = str.replaceAll("^(')", "$1 ");
        str = str.replaceAll("(" + regexNotLetterNumber + ")'", "$1 '");

        // Segments off unambiguous word-final clitics and punctuations.
        str = str.replaceAll("(" + regexClitics + ")$", " $1");
        str = str.replaceAll(
                "(" + regexClitics + ")(" + regexNotLetterNumber + ")",
                " $1 $2");

        // Deals with periods.
        String[] words = str.trim().split("\\s+");
        Pattern p1 = Pattern.compile(".*" + regexLetterNumber + "\\.");
        Pattern p2 = Pattern.compile(
            "^([A-Za-z]\\.([A-Za-z]\\.)+|[A-Z][bcdfghj-nptvxz]+\\.)$");
        for (String word : words) {
            Matcher m1 = p1.matcher(word);
            Matcher m2 = p2.matcher(word);
            if (m1.matches() && !abbrList.contains(word) && !m2.matches()) {
                // Segments off the period.
                tokenList.add(word.substring(0, word.length() - 1));
                tokenList.add(word.substring(word.length() - 1));
            } else {
                tokenList.add(word);
            }
        }
    }

    /**
     * Returns tokenized strings.
     *
     * @return  a list of tokenized strings
     */
    public String[] getTokens() {
        String[] tokens = new String[tokenList.size()];
        tokenList.toArray(tokens);
        return tokens;
    }
}

References:

Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Prentice Hall Series in Artificial Intelligence). Prentice Hall.

Gregory Grefenstette. 1999. Tokenization. In van Halteren, H. (Ed.), Syntactic Wordclass Tagging. Kluwer.

David D. Palmer. 2000. Tokenisation and sentence segmentation. In Dale, R., Moisl, H., and Somers, H. L. (Eds.), Handbook of Natural Language Processing. Marcel Dekker.

Japan Day

May 15th, 2010

Japan Day is an event held at the Bechtel International Center at Stanford University on Saturday, May 8 by Stanford Japanese Association (SJA) and Stanford University Nikkei (SUN). Since I was interested in what was going on in the event but had little time to enjoy it on that day, I just dropped by after lunch, and walked around to look at some demonstrations.

There were several corners for presenting Japanese culture. Out of them, the tea ceremony seemed to be the most popular. Two Japanese women wearing traditional garment, kimonos, demonstrated how to make and have Japanese tea in a formal way. In addition, some Japanese people tried to teach how to write Japanese calligraphy, shodo, and how to do Japanese paper folding, origami. Just being there for about ten minutes, I was struck by a sense of nostalgia.


Two Japanese women demonstrating how to make and have Japanese tea.


Some ornaments for the Boy’s Festival in Japan.

Running in the U.S.

March 23rd, 2010

We finished final exams for the winter quarter last week, and now we have a one-week spring break. Yesterday I started running again. I ran around campus just for a while, and felt great after running. A few months ago, thankfully my father sent me my sportswear and armband for an iPod shuffle that I had been using when running in Tokyo. So I could enjoy running here at Stanford just as I had been doing there.

As I wrote in a previous post, some people around Stanford University are very active. I always see several people enjoying their exercise on campus. Some of them are such avid runners that they push a baby carriage with their baby or babies while running, though I think this is a little dangerous.

Once the spring quarter begins, probably I will be very busy again. But I’d like to enjoy myself doing exercise as much as possible. Incidentally, I am somewhat interested in the general relationships between physical activities and our brains. A suggestive (but informal) article on this topic is from PhDs.org: PE for grad students.