Discover Hidden Trip Themes from GPS Data with Topic Modeling

Using Latent Dirichlet Allocation to Extract Underlying Trip Topics from GPS Trace Data

Tina Tang
Towards Data Science

--

Visualizing LDA extracted trip topics, figure by author

Every text is produced by an author whose utterances, units of discourse exerted with intent, are preserved in writing. As such, large collectives of texts can be used to study pieces of culture. Unsupervised machine learning methods like topic modeling allow us to extract underlying cultural themes from large volumes of text data. With big data growing in the transportation field, a new analysis opportunity arises and researchers can seek to extract meaning and culture from traces of mobility.

When evaluating new transportation modes in cities, local government leaders and city planners may wonder what types of trips users generally take using new services. For instance, in the past few years, e-scooters have gained popularity so quickly that city managers are still searching for the best ways to monitor and optimize operations for their cities. The pandemic has significantly disrupted current micro-mobility operations, but that is a subject for another day. In this article, I am detailing how topic modeling can be used to extract the latent trip themes from an otherwise unstructured collection of estimated trips. As topic modeling is typically used for text analysis, I’ll begin by discussing some background information about Latent Dirichlet Allocation and why it is a fitting method for revealing the underlying trip patterns that characterize a large set of GPS trace data.

Latent Dirichlet Allocation for Trip Analysis

Latent Dirichlet Allocation (LDA) is a generative mixture model for collections of discrete data developed by David M. Blei, Andrew Y. Ng, and Michael I. Jordan. As the model was created in the context of text analysis, it assumes that each document in an unstructured collection of documents is made up of a mixture of topics. The documents are modeled using a hidden Dirichlet random variable and the output of the model is a probabilistic distribution on a latent, lower-dimensional topic space. In other words, from an unstructured collection of documents, the model extracts relevant topics. LDA is a type of topic modeling as it is a method for discovering the hidden topics present in a collection of documents. Beyond text data, topic modeling has been used to find patterns in image data, social network data, and even genetic data.

In the context of trips, an extracted e-scooter trip contains a set of discrete GPS points. Although it is not readily obvious, when tokenized, an ordered set of GPS coordinate pairs looks just like a tokenized document. We can treat each GPS coordinate pair like a unique word in a vocabulary. A trip can then be viewed as type of document containing many GPS point words. Viewed in this way, when we have a large collection of trips that are otherwise unstructured, topic modeling is a very fitting analysis technique. Using topic modeling, we can extract the hidden route patterns within a collection of trips. This will be effective because GPS points that are popular will co-occur across several different trips in the dataset. Similar trips will contain significant overlap with the same GPS points.

Preparing GPS trip data for LDA

In order to perform topic modeling, the e-scooter trip data must be prepared the same way text data is prepared as an input to LDA. Recall that we will treat each unique GPS longitude-latitude pair as a word. Our vocabulary then is the complete set of unique GPS points in our dataset. In text analysis, data scientists typically perform a few pre-processing steps that effectively scale down the vocabulary including removing stop words and stemming. In this context, stemming is relevant and we will carry out steps that accomplish a similar feat. In stemming, words are reduced to their root such that words like “work”, “worked” and “working” are grouped under the word, “work” to ensure word frequency is better accounted for. For my GPS data, the raw trace points are out to the 4th decimal point. Without any pre-processing to group close points together, I have a starting vocabulary of 52,879 unique GPS points from a collection of 42,301 trips. In text analysis, it is typical to trim the vocabulary down to the top 5000 words by various processes including removing stop words, stemming, and removing rare words, etc. for efficient model performance. For this case, I round the GPS points to 3 decimal places, effectively grouping points that describe the same general route point together. After rounding, the vocabulary is reduced to a manageable 3,312 GPS points.

Document-term matrix in terms of GPS data: a trip-point matrix, figure by author

To implement topic modeling on the e-scooter trip data, we leverage scikit-learn, a popular machine learning library. The required input is a document-term matrix illustrated in the figure above. Recall that we will treat each trip as if it is a document. For 42,301 extracted trips, we build a trip-point matrix of size 42,301 rows by 3,312 columns (number of trips by GPS vocabulary size) where the values indicate how many instances a GPS point from the vocabulary occurs in each trip. Next, we feed the trip-point matrix as input to the LDA model, setting it to extract fifteen topics. The output of the LDA model is then fifteen topics that best characterize the collection of trips with a probabilistic distribution of GPS points that make up each topic. Additionally, we can retrieve the probabilistic distribution of topics each trip is made of. For this dataset, the model run time was only about 3 minutes.

from sklearn.decomposition import LatentDirichletAllocationlda_model = LatentDirichletAllocation(n_components=15)
lda_output = lda_model.fit_transform(doc_term_matrix)

Visualizing Topics

Finally, we visualize our topics and the topic distributions. In topic modeling for text, topics are typically visualized by displaying the top k most frequent words per topic. An analyst with domain knowledge can then determine the latent topics in a corpus by examining these groups of top words. While top words are easy to read and interpret, there is little value to be gained by simply looking at the top k GPS coordinate pairs. Instead, we need to process the top k GPS pairs back into latitude and longitude fields and then plot the top k points on a map. For this study, the top 50 GPS points per topic were plotted over a map of Charlottesville to visualize the extracted trip topics. To plot the GPS points, the top words were joined back to the rounded GPS coordinates prior to concatenation.

15 Extracted Trip Topics — Spatial Visualization, figure by author

Looking at the figure above, we can see that the LDA model revealed a diverse set of latent trip types characterizing a large set of e-scooter GPS trace data, separating the data into discrete groupings. These topics, categorized by color in the plot above, can be interpreted by urban planners, city managers, or local researchers with domain knowledge of the Charlottesville landscape. Lastly, the figure below shows how the fifteen extracted trip topics are distributed.

Trip Topic Distribution, figure by author

I’m a fan of Latent Dirichlet Allocation as it provides a means to discover cultural meaning hidden within vast amounts of data. While topic modeling is usually implemented to discover trends in collective texts, here I’m exploring its potential to help us understand trends in collective movement. As the dataset does not contain any user information, this study does not pose any direct user privacy concerns. It’s been interesting exploring LDA’s applicability outside of text data contexts. Thanks for reading.

--

--

An engineer exploring the potential AI/ML has to enable creativity. (she/her) Twitter: @StocasiaAI LinkedIn: https://www.linkedin.com/in/tina-tang-uva/