Introduction

Most tutorials on NLTK presume you are working locally as opposed to in a cloud environment such as Google Colab.

In this blog post, I address using Google Colab to load text files for use as a corpus from:

  1. within Google Colab; and
  2. Google Drive.

When I looked on the web, I did not see any articles addressing this directly. Hence this brief post.

Loading from within Google Colab

First, you must upload the text files into your Colab environment.

Click on the file icon located on the left side of the screen. Navigate the file structure to where you wish to store the files.

By default, your Colab environment will have a /content subfolder. For this post, I created a subfolder called /content/textfiles. This is where I then uploaded the text files for the corpus. To upload, right click on the folder where you wish the files to be placed.

In the below screen shot, you see the file structure and the "test" text files I uploaded.

Now we are ready to load the text files as a corpus. From hereon, the process is essentially the same as if you were working locally.

For the limited purposes of this tutorial, the below dependencies are required.

import nltk
from nltk.corpus import PlaintextCorpusReader

NLTK contains a class called PlaintextCorpusReader() for creating a corpus from text files.

In the below example, we assign the directory where the files are located to a variable (corpus_root).

We then instantiate an instance of PlaintextCorpusReader() and assign it to the variable corpus. The parameters indicate where to find the text files, and which files to include (in this example, all of them).

Finally, to confirm the corpus has been constituted, we call the fileids() method to list the files contained within the corpus.

corpus_root = '/content/textfiles'
corpus = PlaintextCorpusReader(corpus_root, '.*')
corpus.fileids()
['test_data_1.txt', 'test_data_2.txt', 'test_data_3.txt']

If you are using a free Colab account, each time your disconnect from the runtime environment your files will be deleted. To avoid this, either upgrade your Colab account or use Google drive.

Loading from Google Drive

The process for creating a corpus from text files located on your Google Drive is similar to the above. These instructions assume you are using the same Google account for both Colab and Google Drive.

First, upload the text files into your Google Drive. Take note of the directory.

In addition to the dependencies listed above, one more is required. This is to mount your Google Drive in your Colab environment. Then call the the mount method and follow the instructions that result.

from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Now your Google Drive is mounted. The balance of the process is the same as above, i.e. as if you were working with the files directly in your Colab environment. Make sure to revise the file path as necessary.

corpus_root = '/content/drive/MyDrive/Datasets'
corpus = PlaintextCorpusReader(corpus_root, '.*')
corpus.fileids()
['test_data_1.txt', 'test_data_2.txt', 'test_data_3.txt']

Now you are ready to start processing your corpus!

If you have any questions, please feel free to reach out.