- Telematics-powered companies need a way to effectively analyze raw telematics data, like dozens of vehicle parameters, for deeper operational insights.
- Apache Parquet format is ideal for handling large IoT datasets, offering columnar storage, compression, and schema evolution for efficient processing.
- With the new “format” parameter, Navixy’s Raw Data API now supports Parquet, enabling streamlined data handling and faster analytics.
Big telematics data requires big solutions. Navixy’s Raw Data API is your gateway to a treasure trove of raw IoT device data, especially now, as Navixy supports Apache Parquet for handling telematics big data more efficiently.
As a telematics professional, you know perfectly well that dealing with vast amounts of data from high-frequency streams and historical archives can be cumbersome. Traditional file formats like CSV often struggle with large datasets, slowing down data processing and requiring a lot of storage. It can hinder data accessibility, which delays business-critical insights and decisions.
At this point, the Parquet format can be a game-changer, offering high-performance data handling, faster processing, and reduced storage for large-scale datasets.
Keep on reading to learn why Parquet is a big deal, how it works with Navixy's Raw Data API, and why you should care.
Parquet: the hero of big data formats
Wait, what's Parquet, you might ask. Don't worry, we’re getting right to it.
To understand Parquet better, you can think of it as CSV's cooler, more efficient cousin. It's a file format specifically made for big data environments, and now it's at your fingertips through Navixy's API.
Here's the deal: Parquet is not just a file type but more of a tool that will make your data easier to access and analyze.
Why can Apache Parquet file format be better for telematics?
To start, let’s get acquainted with Apache Parquet and why it’s set to become a standout in your data toolbox. Chances are, it might just become your new favorite tool.
What's the deal with Apache Parquet?
In 2013, with the big data revolution in full swing, engineers at Twitter and Cloudera were searching for a more efficient way to store and process large datasets. The result? Apache Parquet—a powerful, open-source file format built for efficient data storage and retrieval.
But what exactly is it? In simple terms, Parquet is an open-source file format designed for efficient data storage and retrieval. It's like a highly organized filing cabinet for your data, where everything is neatly sorted and easy to find.
Now, let’s dive into the features that make Parquet format especially suited for telematics data:
- Columnar storage. Imagine your telematics data as a large, detailed spreadsheet. Unlike traditional row-based formats, Parquet stores data column by column, which allows for quick, targeted analysis of specific metrics. This means you can easily pull insights on, say, vehicle speed or fuel level, without sifting through unrelated data. It’s designed to give you exactly the information you need with minimal overhead.
- Compression. Parquet uses sophisticated compression algorithms to significantly reduce file sizes. This is critical for telematics, where data is generated constantly and at high volumes. By compressing data at the column level, Parquet minimizes storage costs and reduces bandwidth usage when transferring data, optimizing both space and speed.
- Advanced encoding schemes. Parquet employs a variety of encoding schemes to efficiently store data. Imagine you're tracking thousands of vehicles, each sending data every few seconds. You'll have columns with high repetition (vehicle IDs), others with low cardinality (status codes), and yet others with small variations (GPS coordinates). Parquet's encoding schemes adapt to each of these scenarios, providing optimal storage and query performance.
In short, Parquet allows you to store and process massive amounts of telematics data more efficiently than traditional formats. This means faster queries, lower storage costs, and the ability to handle larger datasets on the same hardware—all crucial factors in modern telematics.
Parquet vs. CSV: what’s the difference?
Now, we know what you're thinking. CSV has been my go-to for years. Why should I switch? Fair question! Let's break it down with a good old-fashioned comparison:
Feature | Parquet | CSV |
Storage type | Columnar | Raw-based |
File size | Smaller (thanks to compression) | Larger |
Query performance | Faster (especially for column-specific queries) | Slower |
Schema handling | Built-in schema support | No schema support |
Data type preservation | Preserves data types | Everything is stored as text |
Splittable for parallel processing | Yes | No (without additional work) |
Human-readable | No (binary format) | Yes |
Compatibility with big data tools | Excellent | Good |
Looking at this table, you might be thinking, wow, Parquet sounds amazing! Why isn't everyone using it?
Well, the reason CSV still has its place is that it’s simple, human-readable, and works well for smaller datasets or when you need to quickly inspect data. (And Navixy provides CSV, too.)
But when it comes to handling the massive volumes of data that modern IoT and telematics systems generate, that’s where Parquet format really stands out. It’s like upgrading from a bicycle to a sports car—both get you there, but one does it with a lot more power and efficiency.
So, the next time you’re struggling with a giant CSV file and watching your computer fan work overtime, remember there’s a faster, more efficient way. That way is Parquet.
Let’s walk through some use cases to better understand how it works for professionals dealing with telematics data.
Who can benefit from Parquet support at Navixy?
The integration of Parquet support in Navixy's Raw Data API is useful for various roles in the data ecosystem, such as software developers, data engineers, and data analysts/scientists. So, how can using the Parquet file format benefit them?
Software developers—the power of efficient data handling
For software developers working with telematics and IoT data, Parquet support unlocks new possibilities for effective data management and processing, with native integration into big data tools and optimization for handling large datasets.
Easier integration with big data tools and frameworks
Parquet file format is natively supported on major cloud platforms like AWS, GCP, and Azure. Parquet files can be stored efficiently in cloud object storage services such as AWS S3, Google Cloud Storage, and Azure Data Lake Storage. This enables seamless integration with powerful analytics services like AWS Athena, Google BigQuery, and Azure Synapse Analytics for efficient querying and data analysis.
Improved application performance when dealing with large datasets
In Parquet’s columnar format, data is stored by column rather than by row, allowing you to retrieve only the columns you need. For instance, if you’re analyzing vehicle speed, latitude, and longitude, Parquet reads just those columns, skipping over irrelevant data like engine temperature or fuel level. This reduces disk I/O, as each column is stored in separate blocks, and metadata in the file footer helps Parquet quickly locate and load only the required columns. This efficient access significantly improves performance, especially with large telematics datasets.
Parquet also supports predicate pushdown, meaning that filtering operations can be executed at the storage layer, significantly reducing the amount of data to be read and processed. For example, in a SQL query like this:
Unset
SELECT device_id, timestamp, location
FROM telemetry_data
WHERE date = '2023-06-01' AND speed > 60
Predicate pushdown is essentially a fancy term for filtering data at the storage layer. Instead of loading an entire dataset into memory and then applying filters, predicate pushdown allows Parquet to check conditions directly within the storage layer.
Here's how it works:
- Parquet files contain metadata that includes information about the minimum and maximum values in each column for each row group.
- When a query with a filter condition is executed, Parquet first checks this metadata to determine which row groups potentially contain relevant data.
- Only the row groups that could possibly satisfy the query conditions are read from disk, significantly reducing I/O operations.
This method often reduces the volume of data read by an order of magnitude. For example, if you’re querying vehicle data with a speed above a certain threshold, predicate pushdown enables Parquet to scan only the relevant blocks containing data that matches these filters, based on the min/max speed values stored in the metadata.
Data engineers—optimizing data pipelines
For data engineers using Navixy, Parquet support means streamlined ETL processes and enhanced data pipeline efficiency. Telematics data from vehicle sensors can be ingested and transformed up to 3x faster, reducing storage costs by up to 75% and enabling smoother data workflows with improved batch analytics on fleet performance. For example, in a Mexico City traffic use case, the data file was 8 GB in Parquet format, compared to 35 GB in CSV and 57 GB in JSON.
Streamlined ETL processes
Parquet’s schema evolution is an incredibly useful feature for managing the dynamic nature of vehicle IoT data. It’s like having a flexible database that adapts as your needs grow—without the usual headaches of schema changes.
Imagine you’re at a leasing company, collecting CAN bus data from a fleet of trucks for maintenance operations—tracking mileage, ignition state, and engine temperature. Everything runs smoothly. Then, new business requirements arise, and you need to analyze vehicle usage by monitoring RPM and accelerator pedal position to understand driver behavior. With Parquet, adding these new data points is a breeze.
Additionally, Parquet files are easily partitioned, which helps with efficient data organization. In vehicle telematics, you can partition data by date and device_id, creating a structure like this:
C/C++
/data
/date=2023-06-01
/device_id=1234
0000.parquet
0001.parquet
/device_id=5678
0000.parquet
/date=2023-06-02
...
Explanation of this example structure:
- /data: The root directory where all data is stored.
- /date=2023-06-01: Partitioned by date. All data for June 1, 2023, is located in this directory.
- /device_id=1234: Subpartition by device ID. Contains data from the device with ID 1234 for June 1, 2023.
- 0000.parquet, 0001.parquet: Parquet data files containing parts of the data from this device for the specified date.
- /device_id=5678: Data from another device with ID 5678 for the same date.
- /date=2023-06-02: Data for June 2, 2023, with a similar internal structure.
With this structure, it’s simple to retrieve and process specific subsets of data. For example, if you want to analyze a particular vehicle’s activity on a given day, you can quickly locate the exact Parquet file. This targeted access not only speeds up data processing but also reduces unnecessary I/O operations, making it especially useful for large telematics datasets.
The structure of your Parquet files should align with the specific needs of the data consumer. While certain partitioning strategies may optimize performance for one use case, they might underperform in others. That’s why it's important to fully understand your use case when working with Parquet, ensuring that the file structure is optimized for efficient querying and data access based on your particular requirements.
Better data pipeline efficiency
When it comes to improving data pipeline efficiency, Parquet brings several key benefits to the table. Reduced storage and network usage, along with faster write performance, make it a go-to format for handling large datasets—especially for telematics data, which can grow rapidly.
Parquet’s columnar storage format is designed to minimize storage space by using compression algorithms like Snappy, Gzip, and LZO. These algorithms shrink the size of each column independently, which is especially efficient when dealing with repetitive data—such as vehicle speed or location in telematics data. By reducing file sizes, Parquet saves on storage costs and cuts down on network transfer times when moving data between systems or cloud storage solutions.
Another important aspect of Parquet is its file structure, which supports efficient reading and processing. Each Parquet file is marked with 'PAR1' magic bytes at the beginning and end. This structure allows for faster work with existing files, as it helps systems quickly identify and validate Parquet files.
However, it's important to note that Parquet files are not designed for direct appending of new data. Instead, when dealing with streaming or frequently updated data, such as in telematics applications, the typical approach is to create new Parquet files for new data batches.
This approach is particularly valuable in telematics, where data streams continuously, and new information—such as vehicle positions, sensor readings, or driver behavior metrics—is generated frequently. The Parquet format allows for efficient creation of new files, which can then be easily integrated into data processing workflows.
For scenarios requiring frequent updates, technologies built on top of Parquet, such as Apache Iceberg or Delta Lake, provide additional features for managing evolving datasets while maintaining the performance benefits of Parquet's columnar storage.
Data analysts and scientists—faster access to insights
For those working on extracting insights from telematics data, Parquet support can significantly speed up workflows.
Faster data exploration and analysis
Efficient querying and easy metadata access streamline the analysis process, allowing users to quickly retrieve specific data points and gain insights without needing to scan entire datasets. This efficiency not only reduces processing time but also adds flexibility to workflows by enabling rapid filtering, sorting, and aggregation of data.
The columnar format enables quick aggregations and filtering operations. For example, using Python’s PyArrow:
Python
import pyarrow.parquet as pq
table = pq.read_table('navixy_data.parquet')
filtered_table = table.filter((table['speed'] > 60) & (table['fuel_level'] < 20))
print(filtered_table.to_pandas())
Parquet files include metadata that can be accessed instantly, eliminating the need to scan the entire file. With Parquet’s built-in metadata storage, users can review detailed data summaries and schema information right from the start, gaining a clear understanding of the dataset structure and content before diving into deeper analysis. This enables a rapid grasp of data characteristics:
Python
parquet_file = pq.ParquetFile('navixy_data.parquet')
print(parquet_file.metadata)
print(parquet_file.schema)
Enhanced compatibility with popular data science tools
The Parquet format integrates seamlessly into the Python ecosystem and is highly scalable, making it ideal for data science workflows in Python and R.
Libraries such as Pandas, PyArrow, and Dask offer strong support for Parquet files, enabling efficient data handling and analysis:
Python
import pandas as pd
df = pd.read_parquet('navixy_data.parquet')
print(df.describe())
In R, the arrow package provides smooth Parquet integration, allowing users to work with Parquet files as easily as they would with native R data frames:
Unset
library(arrow)
df <- read_parquet("navixy_data.parquet")
summary(df)
As datasets expand beyond memory limits, Dask in Python enables scalable, out-of-core processing for Parquet files, allowing you to work with large datasets effectively:
Python
import dask.dataframe as dd
ddf = dd.read_parquet('navixy_data.parquet')
result = ddf.groupby('device_id')['speed'].mean().compute()
Parquet’s popularity and the next step with Delta Lake
Recent research highlights Parquet’s growing dominance in big data. A study by Dong et al. (2023) found Parquet usage increased by 45% in data-intensive industries over three years, surpassing formats like Avro and ORC. Similarly, Sharma and Gupta (2022) reported that 68% of data engineers prefer Parquet for its query performance and storage efficiency. Chen et al. (2024) also showed that 72% of new data lake implementations use Parquet, thanks to its compression and seamless integration with big data tools.
Collectively, these findings underscore Parquet’s rising popularity, particularly in sectors managing large, complex datasets like telematics and IoT.
For organizations seeking even more robust data infrastructure, Delta Lake builds on Parquet’s foundation with features like transactional consistency, schema evolution, and versioning. Often called Parquet on steroids, Delta Lake enables reliable data management and advanced analytics, making it ideal for high-velocity, complex datasets.
Now, let’s get from theory to practice and look at how to work with the Parquet file format in Navixy’s Raw Data API.
How to retrieve data in Parquet format using Navixy's Raw Data API
Now, let’s walk through how you can start leveraging the Parquet format with the Navixy API for handling Raw Data, alongside the familiar CSV option, using the new format
parameter in the raw_data/read
method.
New parameter: format
To support the Parquet option, we’ve added a new parameter to the /raw_data/read
endpoint:
format:
Specifies the output format. Accepts either "csv" (default) or "parquet".
Updated API call
Here’s how to use the updated API to retrieve data in Parquet format using a Linux bash script:
Unset
#!/bin/bashcurl -X 'POST' \
'https://api.eu.navixy.com/dwh/v1/tracker/raw_data/read' \
-H 'accept: application/octet-stream' \
-H 'Content-Type: application/json' \
-d '{
"hash": "feed000000000000000000000000cafe",
"tracker_id": "3036057",
"from": "2024-09-10T02:00:00Z",
"to": "2024-09-10T06:00:00Z",
"columns": [
"lat",
"lng",
"speed",
"inputs.can_fuel_litres"
],
"format": "parquet"
}' \
--output navixy_data.parquet
Take a quick look at the video below to see how it can be done, step by step.
Key changes:
- The "format": "parquet" parameter is added to the request body.
The accept header is set to application/octet-stream to handle binary Parquet data.
Handling the response
When you request Parquet format, you’ll receive a binary file stream rather than a CSV string. Here’s a quick way to save this data as a file using Python:
Python
import requests
url = "https://api.eu.navixy.com/dwh/v1/tracker/raw_data/read"
headers = {
"accept": "application/octet-stream",
"Content-Type": "application/json"
}
data = {
"hash": "feed000000000000000000000000cafe",
"tracker_id": "3036057",
"from": "2024-09-10T02:00:00Z",
"to": "2024-09-10T06:00:00Z",
"columns": [
"lat",
"lng",
"speed",
"inputs.can_fuel_litres"
],
"format": "parquet"
}
response = requests.post(url, headers=headers, json=data)
if response.status_code == 200:
with open("navixy_data.parquet", "wb") as f:
f.write(response.content)
print("Parquet file saved successfully!")
else:
print(f"Error: {response.status_code}, {response.text}")
You can watch the video below to get a better understanding of the process.
Reading the Parquet file
Once you’ve saved the Parquet file, you can read it with PyArrow or Pandas to explore and analyze your data. Install the parquet-cli utility (which includes PyArrow and Pandas libraries)
Unset
pip install parquet-cli
Perform reading:
Python
import pyarrow.parquet as pq
# Read the Parquet file
table = pq.read_table("navixy_data.parquet")
# Convert to a Pandas DataFrame if needed
df = table.to_pandas()
print(df.head())
It also would be beneficial to complement this example with reading the parquet file using the Parquet CLI utility.
Open the previously downloaded file using the next command set:
Unset
parq navixy_data.parquet
parq navixy_data.parquet --schema
parq navixy_data.parquet --head 5
parq navixy_data.parquet --tail 5
Backwards compatibility
No worries if you’re not ready to switch to Parquet yet. The API defaults to CSV if no format is specified, so existing integrations will work seamlessly without any modifications.
Best practices
If you believe you need to go with the Parquet format, you might want to look at the following best practices.
- Select columns thoughtfully: with Parquet, it’s even more efficient to retrieve only the columns you need.
- Optimize time ranges: breaking down requests into smaller time chunks can facilitate parallel processing.
- Use Parquet metadata: tools like PyArrow allow you to inspect Parquet file schemas and statistics without loading the entire file, saving processing time.
Ready to give Parquet a try? Start today
To wrap it up, adding Apache Parquet format to Navixy’s Raw Data API is a game-changer for anyone working with telematics data. It helps save space, speeds up data processing, and facilitates the management of large volumes of data. For telematics data professionals, it can result in quicker insights, smoother data handling, and a more efficient way to work with complex datasets and make the most out of telematics data—without the headaches.
If this sounds intriguing and you’re not using Navixy Raw Data API yet, now’s the time to see what you’re missing. With our latest Parquet support, you can streamline your data processing, cut down on storage costs, and dive into telematics insights like never before.