Building connected vehicle data infrastructure at scale: what it takes to do it right

What does it take to go from collecting to actually using connected vehicle data at scale? In this blog post, we break down the key considerations for teams looking to build a scalable data infrastructure for analytics and insights from their data. 

Over the past few years, the automotive industry has invested vast resources into collecting and centralizing connected vehicle data – and other heavy equipment industries are following suit. The market for agriculture equipment telematics is predicted to grow by 2x, and for construction equipment telemetry to grow by 4x, over the coming years. 

With data lakes now holding vast amounts of IoT asset data (whether it’s for connected vehicles, tractors, or mobile cranes), the question becomes: what does it take to build a robust data management infrastructure on top of this data?

Moving beyond simple data collection requires tackling a series of technical challenges – from building a scalable data model to creating actionable insights through AI enrichment. In this post, we’ll explore these challenges in detail, with a focus on how to design a flexible, performant system that can handle the demands of telematics data.

Here’s how we’ve approached building a connected vehicle data infrastructure at Viaduct, and what we recommend for other companies looking to do the same. 

The right data model: balancing flexibility with performance

Ingesting data into a lake is a key milestone, but the real challenge is designing a data model that can support both flexibility and high-performance querying. Telematics data is complex, combining time-series sensor data, diagnostic logs, and vehicle-specific records.

  • Example problem: Take the task of predicting fuel injector failures. Sensor data from multiple vehicle models comes in at irregular intervals, with some sensors sending data continuously and others only during specific conditions. Without the right data model, querying this data can become cumbersome and slow.
  • Solution: At the most basic level, any data model for connected asset data needs to include the concept of an Event (e.g., raw or featurized telematics data, warranty claims, service events) and an Asset (key metadata on the asset itself – e.g., model, manufacturing data, supplier BOM). The data model should preserve the distinction between raw and enriched/derived inputs, and account for the wide variety of input formats (e.g., telematics vs. unstructured text fields from dealer notes). Finally, a well-structured data model will reduce sparsity, eliminating needless data storage and enabling efficient  indexing, searching, and aggregating.  

The key consideration here is designing a model that accommodates a wide range of query patterns while minimizing the performance trade-offs that come from large, sparse datasets.

Feature creation: extracting actionable signals from raw data

One of the most valuable uses of telematics data lies in its ability to predict and prevent failures. However, raw data rarely provides the features needed for these kinds of advanced insights. This is where feature engineering becomes critical.

  • Example problem: Suppose you're working to predict transmission failures. Simply looking at raw transmission temperature data won’t yield actionable insights. The real power lies in creating features that capture more complex patterns—such as the rate of temperature increase during specific driving conditions or correlations between temperature spikes and changes in fluid pressure.
  • Solution: Provide users with the tooling to create composite features by combining signals across time and sensor types. For example, users will likely need to build features that aggregate data over specific windows of time (e.g., the cumulative increase in temperature during a 10-minute driving cycle), and then correlate those features with operating conditions like RPM or vehicle load. Advanced tools like Viaduct’s Event Studio can automate some of this feature engineering by transforming raw sensor data into synthesized events, enabling downstream usage.

Developing tooling that allows users (whether they’re data scientists or less technical stakeholders) to easily create, test, and deploy such features can significantly accelerate the development of predictive models.


Data infrastructure: optimizing for scale and query speed

Even with a well-structured data model and valuable features, the underlying architecture must be designed for scale. Handling massive volumes of telematics data, especially in real-time, requires an architecture that supports both high throughput and low-latency queries.

  • Example problem: Imagine that a global OEM is monitoring thousands of vehicles from different models and regions, each transmitting sensor data every second. As the dataset grows, running analytics or identifying patterns of brake wear across models can become prohibitively slow if your architecture isn’t optimized for
    scale.

  • Solution: Implement a column-oriented database model with an OLAP (Online Analytical Processing) database for rapid querying of large datasets. Data layout storage should be optimized based on query access patterns (e.g., sensor data ordered to allow aggregations by multi-threaded, sequential disk scan). Pre-partitioning the data by vehicle or event type allows for efficient parallel processing, reducing the time needed to run even complex queries. Implementing a caching layer for frequently accessed data can further improve performance. 

The architecture should also be optimized for typical query patterns, enabling high-speed aggregation of data in real-time for critical use cases like early issue identification or predictive failure analysis.


Data access layer: making complex queries accessible

With the data infrastructure in place, the next step is ensuring that all team members – whether they are data scientists, engineers, or fleet managers – can interact with the data easily. This means designing an interaction layer that abstracts away the complexity of querying large datasets, while still providing the power needed for advanced analytics.

  • Example problem: Your quality engineering team needs to investigate brake wear trends, but they’re not database experts. Writing complex SQL queries or managing OLAP systems isn’t part of their skill set.
  • Solution: A simplified query interface, powered by a user-friendly language or even natural language processing (NLP), can make it easier for non-technical users to pull insights from the data. For instance, a quality engineer might ask, “What are the top 5 contributing factors to brake wear in our heavy-duty trucks over the past six months?”—without needing to write SQL.

Additionally, providing API access for more advanced users allows teams to build custom integrations and applications on top of the existing system. This flexibility ensures that different users can access the data in a way that meets their specific needs without adding unnecessary complexity to the workflow.

AI enrichment: detecting, searching, and predicting

AI is the key to getting value from your connected asset data – turning large volumes of raw telemetry data into insights that drive business value. 

Broadly, AI helps teams with three key tasks: 1) Detecting hidden patterns within telemetry data that suggest either issues or business opportunities 2) Efficiently searching to investigate what those patterns mean and what’s causing them; and 3) Predicting asset health with high-precision risk models. 

  • Example problem: An automotive OEM is trying to predict battery failures for a specific vehicle model based on a known failure mode. But off-the-shelf machine learning tools don’t work well for predicting these kinds of failures. Part of the challenge is the relative infrequency of failures or the complexity/noisiness of the telematics data being used or forecasting. There are also more nuanced considerations: data censoring (i.e., we only know about failures that have actually occurred to date), predicting failures early enough to minimize risk but late enough to have high confidence, and complex cost functions (i.e., should a vehicle be recalled for immediate repair or repaired at the next regular check-in? More on the cost tradeoffs associated with different precision/recall levels here). 
  • Solution: Instead of building custom AI models from scratch, pre-built models designed for telematics data can significantly speed up the process. Viaduct provides out-of-the-box capabilities to identify anomalies, investigate root causes, and predict future failures. These models are fine-tuned for automotive applications and integrate seamlessly with your existing architecture.

Moreover, AI models should be designed to adapt based on feedback loops – so that predictions become more accurate over time as new data is ingested and processed.

Presentation layer: dashboards tailored to every role

Finally, ensuring that your data insights are easily accessible across the organization is critical to driving real value. A unified access layer that offers custom dashboards and reporting is essential for different teams to interact with the data based on their specific needs.

  • Example problem: Quality managers need an overview of production issues, while engineering teams need detailed predictions about component failures in specific vehicle lines. Without a unified interface, these teams are left juggling multiple systems.
  • Solution: A single access layer that integrates live telemetry data, historical insights, and AI-powered predictions allows each user to have a personalized experience. Custom dashboards tailored to specific roles—such as high-level fleet performance metrics for managers or granular failure mode analysis for engineers—ensure that everyone has access to the data they need, when they need it.

The ability to build and customize dashboards also promotes a more data-driven culture, empowering users to interact with the system directly without relying on external teams for every query or report.

The bottom line: unlocking value from your data investment

Centralizing your data into a lake is a major step forward, but the real value comes from what you build on top of it. By addressing key challenges around data modeling, feature creation, architecture, AI enrichment, and user accessibility, you can transform raw data into actionable insights that improve operations and reduce downtime.

Building this infrastructure isn’t easy, but with the right tools and architecture, you can maximize the potential of your connected vehicle data.
Learn more about how Viaduct can help you build and scale your asset management infrastructure today.

More articles