Archive for the ‘Software’ Category

Libtree 0.1 released

Monday, June 20th, 2011

A tree is one of the most important and fundamental data structures in computer science. It is so fundamental that I have so far come across that data structure in my research and various classes quite frequently. As such, I have decided to release my own Java implementation of an abstract tree data structure and some associated operations that I think might be useful in different situations. You can see more information about the implementation, and download the code in the following web page: http://junaraki.net/software/libtree/ . I made the initial release as simple as possible, and the essential code just involves five Java files; I’m planning to enhance the software little by little as well as its documentation.

Writing something into a text file with Java

Wednesday, October 27th, 2010

In relation to a previous post about the code for reading a text file with Java, I will attach another (trivial) code for writing something into a text file. This post is also just for future reference. I have confirmed that it works with Java 1.6.0_18.

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.FileNotFoundException;
import java.io.IOException;

class SomeClass {
    public static void writeFile(String filename) {
        BufferedWriter fout = null;

        try {
            fout = new BufferedWriter(new FileWriter(filename));

            while (/* Some condition */) {
                // Write something.
                StringBuffer lineBuffer = new StringBuffer();
                lineBuffer.append("abc");
                lineBuffer.append("def");

                fout.write(lineBuffer.toString());
                fout.newLine();
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (fout != null) fout.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public static void main(String args[]) {
        writeFile("sample.txt");
    }
}

English tokenizer

Saturday, July 17th, 2010

When I build statistical language models (e.g., bigrams and trigrams) trained on a particular corpus or some set of documents, firstly I feel like taking a look at some statistical properties of the set, such as the total number of tokens, the average number of tokens per sentence or per utterance, and so on. Any set of documents, even one consisting of a tremendous number of newspaper articles, is biased in some manner from a statistical perspective, mainly because people collect the data in a particular domain or domains.

Suppose that I need a list of unique words from English sentences described in a text file sentences.txt. Then my initial step is often to use a crude shell command like this:

$ cat sentences.txt | tr ' ' '\n' | sed '/^$/ d' | sort | uniq > unique_words.txt

The list I obtain with this command is neither tokenized nor lemmatized, but it could be sufficient for a quick analysis where I try to get a handle on approximately how many unique words occur in the target document.

If I need a more complicated analysis like extracting a list of unique words with their frequencies, the next step is likely to involve tokenization. A range of tokenization algorithms have so far been proposed according to respective natural languages. As for English, it seems that the simplest way is to build a tokenizer with regular expressions, as mentioned in Jurafsky and Martin 2000. For future reference, I will attach my Java code for English tokenization to this post.

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

/**
 * Tokenizes strings described in English.
 *
 * @author Jun Araki
 */
public class EnglishTokenizer {
    /** A string to be tokenized. */
    private String str;

    /** Tokens. */
    private ArrayList<String> tokenList;

    /** A regular expression for letters and numbers. */
    private static final String regexLetterNumber = "[a-zA-Z0-9]";

    /** A regular expression for non-letters and non-numbers. */
    private static final String regexNotLetterNumber = "[^a-zA-Z0-9]";

    /** A regular expression for separators. */
    private static final String regexSeparator = "[\\?!()\";/\\|`]";

    /** A regular expression for separators. */
    private static final String regexClitics =
        "'|:|-|'S|'D|'M|'LL|'RE|'VE|N'T|'s|'d|'m|'ll|'re|'ve|n't";

    /** Abbreviations. */
    private static final List<String> abbrList =
        Arrays.asList("Co.", "Corp.", "vs.", "e.g.", "etc.", "ex.", "cf.",
            "eg.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.", "Jul.", "Aug.",
            "Sept.", "Oct.", "Nov.", "Dec.", "jan.", "feb.", "mar.",
            "apr.", "jun.", "jul.", "aug.", "sept.", "oct.", "nov.",
            "dec.", "ed.", "eds.", "repr.", "trans.", "vol.", "vols.",
            "rev.", "est.", "b.", "m.", "bur.", "d.", "r.", "M.", "Dept.",
            "MM.", "U.", "Mr.", "Jr.", "Ms.", "Mme.", "Mrs.", "Dr.",
            "Ph.D.");

    /**
     * Constructs a string to be tokenized and an empty list for tokens.
     *
     * @param  str  a string to be tokenized
     */
    public EnglishTokenizer(String str) {
        this.str = str;
        tokenList = new ArrayList<String>();
    }

    /**
     * Tokenizes a string using the algorithms by Grefenstette (1999) and
     * Palmer (2000).
     */
    public void tokenize() {
        // Changes tabs into spaces.
        str = str.replaceAll("\\t", " ");

        // Puts blanks around unambiguous separators.
        str = str.replaceAll("(" + regexSeparator + ")", " $1 ");

        // Puts blanks around commas that are not inside numbers.
        str = str.replaceAll("([^0-9]),", "$1 , ");
        str = str.replaceAll(",([^0-9])", " , $1");

        // Distinguishes single quotes from apstrophes by segmenting off
        // single quotes not preceded by letters.
        str = str.replaceAll("^(')", "$1 ");
        str = str.replaceAll("(" + regexNotLetterNumber + ")'", "$1 '");

        // Segments off unambiguous word-final clitics and punctuations.
        str = str.replaceAll("(" + regexClitics + ")$", " $1");
        str = str.replaceAll(
                "(" + regexClitics + ")(" + regexNotLetterNumber + ")",
                " $1 $2");

        // Deals with periods.
        String[] words = str.trim().split("\\s+");
        Pattern p1 = Pattern.compile(".*" + regexLetterNumber + "\\.");
        Pattern p2 = Pattern.compile(
            "^([A-Za-z]\\.([A-Za-z]\\.)+|[A-Z][bcdfghj-nptvxz]+\\.)$");
        for (String word : words) {
            Matcher m1 = p1.matcher(word);
            Matcher m2 = p2.matcher(word);
            if (m1.matches() && !abbrList.contains(word) && !m2.matches()) {
                // Segments off the period.
                tokenList.add(word.substring(0, word.length() - 1));
                tokenList.add(word.substring(word.length() - 1));
            } else {
                tokenList.add(word);
            }
        }
    }

    /**
     * Returns tokenized strings.
     *
     * @return  a list of tokenized strings
     */
    public String[] getTokens() {
        String[] tokens = new String[tokenList.size()];
        tokenList.toArray(tokens);
        return tokens;
    }
}

References:

Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Prentice Hall Series in Artificial Intelligence). Prentice Hall.

Gregory Grefenstette. 1999. Tokenization. In van Halteren, H. (Ed.), Syntactic Wordclass Tagging. Kluwer.

David D. Palmer. 2000. Tokenisation and sentence segmentation. In Dale, R., Moisl, H., and Somers, H. L. (Eds.), Handbook of Natural Language Processing. Marcel Dekker.

Reading a text file with Java

Saturday, February 20th, 2010

Recently I regularly use Java in some classes and my research. In particular, I often implement a similar code to read a text file for the purpose of some text processing. So I will attach the trivial code to this post for future reference. I confirmed that it works with Java 1.6.0_16.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileNotFoundException;
import java.io.IOException;

class SomeClass {
    public static void readFromFile(String filename) {
        BufferedReader fin = null;

        try {
            fin = new BufferedReader(new FileReader(filename));
            String line = null;
            while ((line = fin.readLine()) != null) {
                // Do something.
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (fin != null) fin.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public static void main(String args[]) {
        readFromFile("sample.txt");
    }
}

Incidentally I have used BufferedReader rather than BufferedInputStream simply because BufferedReader is more suitable to the text processing that I need to do right now.

CSV to HTML converter

Saturday, February 13th, 2010

I have spent some time implementing a small script with Python 2.6.2 to help my trivial work: concerting CSV to HTML (more precisely a CSV file to an HTML table). The CSV format for its input depends on what the csv module in Python specifies. The code is pretty straightforward:

#!/usr/bin/python

# csv2html.py
# CSV to HTML Converter

import csv
import sys

table_indent_num = 2
tr_indent_num = 4
td_indent_num = 6
white_space = " "

def main():
    csv_reader = csv.reader(open(sys.argv[1]))
    table_indent = white_space * table_indent_num
    tr_indent = white_space * tr_indent_num
    td_indent = white_space * td_indent_num

    print table_indent + "<table>"
    for i, row in enumerate(csv_reader):
        print tr_indent + "<tr>"

        # Uncomment the following two lines if you don't
        # want a column for indexes
        if i == 0: print td_indent + "<th>#</th>"
        else: print td_indent + "<td>" + str(i) + "</td>"

        # Assume that the first line is a header
        for column in row:
            if i == 0: print td_indent + "<th>" + column + "</th>"
            else: print td_indent + "<td>" + column + "</td>"

        print tr_indent + "</tr>"

    print table_indent + "</table>"

if __name__ == "__main__":
    argn = len(sys.argv)
    if argn != 2:
        print "Usage: python csv2html.py <CSV file>"
        exit(1)

    main()

An example of usage is as follows:

$ more sample.csv
title1,title2,title3
"test11",test12,test13
test21,"test,22",test23
$ python csv2html.py sample.csv > sample.html
$ more sample.html
  <table>
    <tr>
      <th>#</th>
      <th>title1</th>
      <th>title2</th>
      <th>title3</th>
    </tr>
    <tr>
      <td>1</td>
      <td>test11</td>
      <td>test12</td>
      <td>test13</td>
    </tr>
    <tr>
      <td>2</td>
      <td>test21</td>
      <td>test,22</td>
      <td>test23</td>
    </tr>
  </table>