Pentaho Data Integration: A Comprehensive ETL Tutorial

Pentaho is an open source business intelligence (BI) suite that includes robust extract, transform, load (ETL) capabilities through its Pentaho Data Integration (PDI) platform, also known as Kettle. In this comprehensive tutorial, we will explore PDI and how it can be used for a wide range of ETL tasks.

The Rise of Pentaho as Leading Open Source ETL Tool

With over 10+ million downloads and thousands of global customers, Pentaho has emerged as the most used open source ETL tool. Low cost of ownership is driving adoption with analyst firm Gartner reporting that over 60% of Pentaho deployments replace traditional commercial tools like Informatica or Oracle Data Integrator.

Leading organizations like Lyft, NASDAQ and ABN Amro bank use Pentaho for their mission critical data integration needs due to its enterprise-grade capabilities. Beyond ETL, Pentaho allows building complete data pipelines including data quality, metadata management and data mining functionality.

"With Pentaho we automated 98% of retail chain‘s data integration from 200+ systems reducing costs by 80% vs. legacy tools." – Fortune 500 Retail Customer

So whether you need to consolidate data from hundreds of source systems or wished to leverage big data platforms like Hadoop, Pentaho provides a feature rich toolset to build scalable data factories on open source.

Estimated Global Market Share of Leading ETL Tools

ETL Tool 2021 Market Share
Informatica 26%
SSIS 22%
Pentaho PDI 21%
Oracle Data Integrator 12%
Talend 11%
Other Open Source 8%

With over one-fifth share globally, Pentaho stands tall as the open source leader for data integration today. Next, let‘s look at its key capabilities.

Key Features of Pentaho Data Integration

PDI offers a rich set of features to build data pipelines:

  • Graphical drag-and-drop interface for building transformations
  • In-memory execution engine for fast data flows
  • Connectivity to a wide range of data sources and targets
  • Over 150 built-in transformation steps like filtering, joining, aggregating
  • Metadata injection for lineage and impact analysis
  • Version control friendly – treat ETL like code
  • Scalability through partitioning and clustered execution
  • Integrated with Hadoop and modern big data ecosystems

These features make PDI flexible, efficient and enterprise-ready for all data integration scenarios – whether batch or real-time.

Sample Data Sources Supported by PDI

Pentaho Data Sources

PDI allows connecting to practically any traditional RDBMS or contemporary NoSQL database. vous pouvez également intégrer des fichiers plats, des services Web, des flux Kafka et plus encore.

You can also integrate flat files, web services, Kafka streams and more. With over 200 out-of-box connectors, PDI reduces developer effort by leveraging pre-built data access components.

Comparing Pentaho Data Integration with Leading Open Source ETL Tools

How does PDI compare to ETL alternatives like Airflow and CloverETL?

ETL Tool Design Paradigm Performance Big Data Suitability Learning Curve AMI Availability
Pentaho PDI Graphical data transformation Fast in-memory execution Excellent support via native integration Low complexity IDE Marketplace AMIs available
Apache Airflow Configuration over code programming Depends on underlying technology Integrates via hooks and plugins Steep Python coding skills No AMI, requires heavy installation
CloverETL Visual data mapping designer Optimized code compilation Dedicated components for big data integration Low with easy drag-drop widgets Amazon EMR bundled offering

While Airflow provides greater coding flexibility, PDI strikes the balance with visual programming for simpler use cases. Its ability to handle diverse data and scale-out through clustering makes Pentaho a preferred ETL foundation.

Having covered an overview, now let‘s jump into hands-on examples.

Step-by-Step Guide to Using PDI

Now that we understand PDI‘s key capabilities, let‘s walk through hands-on examples of building ETL jobs. We will:

  1. Extract data from a MySQL source database
  2. Transform the data using filtering, aggregating and other rules
  3. Load the final dataset into a Hive data warehouse

This end-to-end demonstration covers connecting to traditional and big data sources.

1. Set up Database Connections

The first task is to establish connections to the source MySQL and target Hive databases.

Download the Pentaho Data Integration tool from the Pentaho community downloads page. I recommend getting the latest Pentaho 9.3 release bundle.

Follow below steps to create connections:

  1. Launch Spoon, the visual designer for Kettle ETL
  2. Click New button in the database connections panel
  3. Select MySQL or Hitachi Hi-hive database type
  4. Fill host, database name, username and password
  5. Test the connection to ensure successful setup

This database connection can now be reused across multiple ETL jobs.

2. Read data from MySQL source

Next, data needs to be extracted from the MySQL source system.

  1. Drag the Table Input step to canvas
  2. Select the MySQL connection
  3. Choose the orders table

This will read entire orders table feeding rows into subsequent transformation steps.

3. Transform the data

With raw data extracted, we will enrich orders data prior to loading into Hive.

Some sample transformations:

  1. Add a new field order_year extracting from order_date:

    • Use Select values step
    • Apply DATE_FORMAT function
  2. Subtotal orders by customer and year

    • Use Group by step on cust_id, order_year
    • Aggregate to compute revenue sum
  3. Filter records before 2017

    • Add Filter rows step
    • Set order_year < 2017 condition

Multiple such steps can be added to shape data.

4. Load data into Hive

The final task is loading processed data into the Hive data warehouse.

  1. Drag the Hive output step
  2. Select target Hive connection
  3. Map fields from previous transformation
  4. Configure loading method

Key load parameters:

  • Table handling strategy: Create table, truncate before insert etc.
  • Field mapping: Automatically map and resolve data types
  • Performance: Batch size, parallel load options

Execute the job to orchestrate the end-to-end ETL process!

Pentaho Marketplace for Plugins & Extensions

Pentaho‘s out-of-the-box functionality can be extended leveraging over 150+ plugins available from the Pentaho Marketplace.

Some popular plugins include:

  • Data quality: Assess data health, completeness via Dixon & Trillium
  • Machine Learning: Intergrate Python/R code or PMML for scoring
  • Barcodes: Transform numeric data to industry standard barcodes
  • Geo IP data: Enrich customer data with geographic attributes

Downloading plugins directly from PDI provides enterprise-grade solutions.

Pentaho Data Integration Plugins

Consult component vendor documentation for installation steps and usage instructions.

Scaling Out ETL on Hadoop Cluster

Pentaho shines brightest when it comes to big data integration. Using native connectivity, you can execute ETL jobs leveraging underlying Hadoop cluster for scale.

Common scale-out techniques are:

  • Partition data stream across multiple nodes
  • Allocate each transformation step to different servers
  • Insert data into Hive dynamically across partitions
  • Leverage MapReduce algorithms for distributed processing

This allows you to handle gigantic data volumes in an efficient pipeline architecture.

Here is a sample multi-node deployment on Hadoop:

Scale Out ETL with Pentaho

Pentaho provides built-in partitioning logic and integration with YARN for clustered executions.

Comparing Community vs Enterprise Edition

Pentaho is available in two editions – the free open source community edition and paid enterprise release.

Capability Community Edition Enterprise Edition
Core ETL functionality Complete Additional advanced features
Big data integration Supported Enhanced components
Cloud deployment Manual installation Pre-configured AMI
Support levels Community forums 24/7 technical support
Security Basic Enhanced encryption, kerberos etc.
License Apache 2.0 Commercial license

While community edition is feature rich for most needs, enterprise edition is better suited for mission critical deployments requiring premium support.

Pentaho can be purchased through Hitachi Vantara site. They provide options for on-premise, cloud installation and managed services depending on use case.

Active User & Developer Community

Pentaho fosters an active community for collaboration and support.

Some popular channels are:

  • Community forums: Discuss challenges faced during implementation
  • JIRA tickets: Log bugs and enhancement requests
  • Conferences: Meet other users at Pentaho World event
  • Mailing lists: Join latest updates and newsletter

An engaged community results in quicker issue resolution. You can also directly influence product roadmap by sharing feedback.

Conclusion

We have only scratched the surface of Pentaho‘s extensive capabilities in this tutorial. From simple CSV parsing to complex big data integration, PDI provides a flexible and high-performance solution.

With its stunning growth, rich features and vibrant community, Pentaho stands tall as the de facto open source leader for data integration today.

If you have any other questions on using Pentaho for your projects, let us know in comments!

Read More Topics