Changes to the tm package!

Recently R’s most beloved text mining package was updated. One of the notable changes is the removal of readTabular. The package authors made text mining documents with metadata much easier. Instead of manually declaring metadata with a mapping scheme the package inherits metadata from a DataframeSource. While this is good the the text mining world, of course, it breaks some code in my book!

So here is some rewritten code showing what was changed from page 43 of the book. You should notice that VCorpus doesn’t need a meta data reader object as part of the function. Further, using the brackets you can see both the original document and the metadata, here only that it is #103.

#DEPRECATED:
#tweets<-data.frame(ID=seq(1:nrow(text.df)),text=text.df$text)
tweets<-data.frame(doc_id=seq(1:nrow(text.df)),text=text.df$text)

#DEPRECATED:
#meta.data.reader <- readTabular(mapping=list(content="text", id="ID"))
#corpus <- VCorpus(DataframeSource(tweets), readerControl=list(reader=meta.data.reader))

corpus <- VCorpus(DataframeSource(tweets))
corpus<-clean.corpus(corpus)
corpus[[103]][1]
corpus[[103]][2]

Keep in mind:
* If using DataframeSource the first column MUST be named doc_id followed by a text column. Any other columns are considered metadata associated row-wise.

As changes to packages occur I will try to update the book's readme file here.

tedkwartler.com

Recent Posts

Meta

Categories

Archives