3 Ways to Apply Latent Semantic Analysis on Large-Corpus Text on macOS Terminal, JupyterLab, and Colab

Latent semantic analysis works on large-scale datasets to generate representations to discover the insights through natural language processing. There are different approaches to perform the latent semantic analysis at multiple levels such as document level, phrase level, and sentence level. Primarily semantic analysis can be summarized into lexical semantics and the study of combining individual words into paragraphs or sentences. The lexical semantics classifies and decomposes the lexical items. Applying lexical semantic structures has different contexts to identify the differences and similarities between the words. A generic term in a paragraph or a sentence is hypernym and hyponymy provides the meaning of the relationship between instances of the hyponyms. Homonyms contain similar syntax or similar spelling with similar structuring with different meanings. Homonyms are not related to each other. Book is an example for homonym. It can mean for someone to read something or an act of making a reservation with similar spelling, form, and syntax. However, the definition is different. Polysemy is another phenomenon of the words where a single word could be associated with multiple related senses and distinct meanings. The word polysemy is a Greek word which means many signs. Python provides NLTK library to perform tokenization of the words by chopping the words in larger chunks into phrases or meaningful strings. Processing words through tokenization produce tokens. Word lemmatization converts words from the current inflected form into the base form.

Figure 1. Code snippet for word lemmatization.


Figure 2. Different data sources for natural language processing with Python.

Latent semantic analysis

Applying latent semantic analysis on large datasets of text and documents represents the contextual meaning through mathematical and statistical computation methods on large corpus of text. Many times, latent semantic analysis overtook human scores and subject matter tests conducted by humans. The accuracy of latent semantic analysis is high as it reads through machine readable documents and texts at a web scale. Latent semantic analysis is a technique that applies singular value decomposition and principal component analysis (PCA). The document can be represented with Z x Y Matrix A, the rows of the matrix represent the document in the collection. The matrix A can represent numerous hundred thousands of rows and columns on a typical large-corpus text document. Applying singular value decomposition develops a set of operations dubbed matrix decomposition. Natural language processing in Python with NLTK library applies a low-rank approximation to the term-document matrix. Later, the low-rank approximation aids in indexing and retrieving the document known as latent semantic indexing by clustering the number of words in the document.

Brief overview of linear algebra

The A with Z x Y matrix contains the real-valued entries with non-negative values for the term-document matrix. Determining the rank of the matrix comes with the number of linearly independent columns or rows in the the matrix. The rank of A ≤ {Z,Y}. A square c x c represented as diagonal matrix where off-diagonal entries are zero. Examining the matrix, if all the c diagonal matrices are one, the identity matrix of the dimension c represented by Ic. For the square Z x Z matrix, A with a vector k which contains not all zeroes, for λ. The matrix decomposition applies on the square matrix factored into the product of matrices from eigenvectors. This allows to reduce the dimensionality of the words from multi-dimensions to two dimensions to view on the plot. The dimensionality reduction techniques with principal component analysis and singular value decomposition holds critical relevance in natural language processing. The Zipfian nature of the frequency of the words in a document makes it difficult to determine the similarity of the words in a static stage. Hence, eigen decomposition is a by-product of singular value decomposition as the input of the document is highly asymmetrical. The latent semantic analysis is a particular technique in semantic space to parse through the document and identify the words with polysemy with NLKT library. The resources such as punkt and wordnet have to be downloaded from NLTK.

Deep Learning at scale with Google Colab notebooks

Figure 3. NVIDIA Deep Learning stack with GPUs.

Training machine learning or deep learning models on CPUs could take hours and could be pretty expensive in terms of the programming language efficiency with time and energy of the computer resources. Google built Colab Notebooks environment for research and development purposes. It runs entirely on the cloud without requiring any additional hardware or software setup for each machine. It’s entirely equivalent of a Jupyter notebook that aids the data scientists to share the colab notebooks by storing on Google drive just like any other Google Sheets or documents in a collaborative environment. There are no additional costs associated with enabling GPU at runtime for acceleration on the runtime. There are some challenges of uploading the data into Colab, unlike Jupyter notebook that can access the data directly from the local directory of the machine. In Colab, there are multiple options to upload the files from the local file system or a drive can be mounted to load the data through drive FUSE wrapper.

Figure 4. Installing a drive FUSE wrapper.

Once this step is complete, it shows the following log without errors:

Figure 5. Installation log on macOS that shows the installation

The next step would be generating the authentication tokens to authenticate the Google credentials for the drive and Colab

Figure 6. Authenticate the credentials.

If it shows successful retrieval of access token, then Colab is all set.

Figure 7. Access token verification.

At this stage, the drive is not mounted yet, it will show false when accessing the contents of the text file.

Figure 8. Verifying the access to Google drive Colab notebook uploaded files.

Once the drive is mounted, Colab has access to the datasets from Google drive.

Figure 9. Type your caption here.

Once the files are accessible, the Python can be executed similar to executing in Jupyter environment. Colab notebook also displays the results similar to what we see on Jupyter notebook.

Figure 10. Results from the program.

PyCharm IDE

The program can be run compiled on PyCharm IDE environment and run on PyCharm or can be executed from OSX Terminal.

Figure 11. LSA analysis in Python natural language processing in PyCharm IDE.

Results from OSX Terminal

Figure 12. Results from OSX Terminal.

Jupiter Notebook on standalone machine

Jupyter Notebook gives a similar output running the latent semantic analysis on the local machine:

Figure 13. Running the latent semantic analysis on Jupyter notebook.

Figure 14. Results.


Gorrell, G. (2006). Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing. Retrieved from https://www.aclweb.org/anthology/E06-1013

Hardeniya, N. (2016). Natural Language Processing: Python and NLTK . Birmingham, England: Packt Publishing.

Landauer, T. K., Foltz, P. W., Laham, D., & University of Colorado at Boulder (1998). An Introduction to Latent Semantic Analysis. Retrieved from http://lsa.colorado.edu/papers/dp1.LSAintro.pdf

Stackoverflow (2018). Mounting Google Drive on Google Colab. Retrieved from https://stackoverflow.com/questions/50168315/mounting-google-drive-on-google-colab

Stanford University (2009). Matrix decompositions and latent semantic indexing. Retrieved from https://nlp.stanford.edu/IR-book/html/htmledition/matrix-decompositions-and-latent-semantic-indexing-1.html