February, 2012

Testing out the NLTK sentence tokenizer

I’m currently taking a text mining class with Dr. Catherine Blake at GSLIS. Our first assignment was to pre-process 50,000 .txt files containing scientific abstracts and information about them, all formatted with whitespace and linebreaks. We had to extract the award organization, abstract ID number, and the abstract itself. In addition, we had to split [...]