Multi-view feature engineering for day-to-day joint clustering of multiple traffic datasets

, , & (2024) Multi-view feature engineering for day-to-day joint clustering of multiple traffic datasets. Transportation Research Part C: Emerging Technologies, 162, Article number: 104607.

Open access copy at publisher website

Description

A common task in traffic data analysis and management is categorizing different days based on similarities in their network-wide traffic states. Given the multifaceted nature of traffic, it is essential to consider multiple attributes for a comprehensive quantification. However, challenges arise when combining these attributes to achieve consistent day-to-day classification across datasets. While various data-driven classification algorithms have been proposed in traffic literature, challenges persist. These include a) applicability limited to univariate datasets, b) incompatibility with datasets containing missing values, c) distance concentration problem in high-dimensional clustering, d) inability to classify outliers, and e) computationally expensive hyperparameter optimization. This research introduces the MCMD (Multi-view Classification based on Consensus Matrix Decomposition) framework, a novel approach for the joint classification of multi-view traffic data. MCMD treats multiple traffic datasets with varying geographical coverage as complementary views of the entire network's traffic state. It then extracts shared hidden features across these datasets and assigns each day classification labels that are consistent across views. MCMD consists of three key modules: the novel Multi-view Uni-orthogonal Non-negative Matrix Factorization (MUNMF) algorithm, an outlier removal module, and the Ordering Points to Identify the Clustering Structure (OPTICS) algorithm. A logical integration of the above-stated modules enables MCMD to a) output scale-invariant (SI) and scale-variant (SV) classifications and b) identify outlier days based on the shape and scale of multi-view traffic-state profiles. Compared to existing clustering methods, the design of the MCMD algorithm offers greater versatility in handling both single- and multi-view datasets for SI and SV clustering, computational robustness to missing data, and resilience to the “distance concentration problem” associated with the curse of dimensionality. These advantages stem from its ability to extract relevant cross-dataset features, reduce dimensionality, and eliminate redundancy. Although the primary motivation of the framework is derived from the need to develop a traffic pattern repository to support reliable prior Origin-Destination (OD) selection for online dynamic OD demand adjustment, the paper, through extensive experiments on real-world and synthetic traffic datasets, demonstrates the effectiveness of MCMD from several generic standpoints. These include a) demonstration of the meaningfulness of SI and SV labels, b) assessment of the robustness toward missing information, c) evaluation of its effectiveness in classifying days with special events, d) benchmarking properties against alternative joint day-to-day clustering algorithms, and e) demonstrating the efficacy of the proposed hyperparameter selection method for efficient joint classification of multiple large-scale traffic datasets.

Impact and interest:

0 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

ID Code: 248196
Item Type: Contribution to Journal (Journal Article)
Refereed: Yes
ORCID iD:
Nayak, Richiorcid.org/0000-0002-9954-0159
Bhaskar, Ashishorcid.org/0000-0001-9679-5706
Additional Information: Acknowledgment: We express our sincere gratitude to the Queensland Department of Transport and Main Roads (TMR) and iMOVE CRC project (#1-027 titled Advanced data analytics: Real-time demand calibration/prediction) investigators for their invaluable support and collaboration throughout this study. Their contributions have played a crucial role in enhancing the quality and outcomes of our research. TMR provides the data from the Logan city data for research and training activities at QUT. Partial funding support from QUT Centre for Data Science (CDS) is also acknowledged. During the preparation of this work, we used ChatGPT (GPT-3.5) to proofread the manuscript to improve readability. After using this tool/service, we reviewed and edited the content as needed and take full responsibility for the content of the publication.
Measurements or Duration: 43 pages
Keywords: Coupled Matrix Factorization, Day-to-day Classification, MCMD, Missing Data, Multi-view Feature Engineering, Multivariate Traffic Data
DOI: 10.1016/j.trc.2024.104607
ISSN: 0968-090X
Pure ID: 167517203
Divisions: Current > Research Centres > Centre for Data Science
Current > Research Centres > Centre for Future Mobility/CARRSQ
Current > QUT Faculties and Divisions > Faculty of Science
Current > Schools > School of Computer Science
Current > QUT Faculties and Divisions > Faculty of Engineering
Current > Schools > School of Civil & Environmental Engineering
Current > QUT Faculties and Divisions > Faculty of Health
Funding Information: We express our sincere gratitude to the Queensland Department of Transport and Main Roads (TMR) and iMOVE CRC project (#1-027 titled Advanced data analytics: Real-time demand calibration/prediction) investigators for their invaluable support and collaboration throughout this study. Their contributions have played a crucial role in enhancing the quality and outcomes of our research. TMR provides the data from the Logan city data for research and training activities at QUT. Partial funding support from QUT Centre for Data Science (CDS) is also acknowledged. During the preparation of this work, we used ChatGPT (GPT-3.5) to proofread the manuscript to improve readability. After using this tool/service, we reviewed and edited the content as needed and take full responsibility for the content of the publication.
Copyright Owner: 2024 The Author(s)
Copyright Statement: This work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the document is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recognise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to qut.copyright@qut.edu.au
Deposited On: 24 Apr 2024 00:15
Last Modified: 02 Aug 2024 02:49