Scheduling Same-Day Data Fetch from MongoDB to Hadoop with Apache NiFi

Duong Hoang
Jun 22, 2024
2 min read

Introduction to Apache NiFi

What is Apache NiFi?

Apache NiFi is an open-source tool designed to automate data flow between various systems. It offers a range of features that make it a robust solution for data management and integration.

Key Features of Apache NiFi

Browser-Based User Interface: Manage and monitor data flows through an intuitive web interface.
Data Provenance Tracking: Keep track of data origins and transformations.
Extensive Configuration: Customize and configure workflows to fit your needs.
Extensible Design: Extend functionalities with custom processors and connectors.
Secure Communication: Ensure data security with encrypted communication.

NiFi Architecture

Apache NiFi operates within a Java Virtual Machine (JVM) on the host operating system. The primary components of NiFi include:

Web Server: Hosts NiFi's control API and HTTP-based commands.
Flow Controller: Manages the execution of data flows and schedules resources.
Extensions: Various types of extensions execute within the JVM to handle specific tasks.
Flow File Repository: Tracks the state of FlowFiles within the flow, using a continuously written-ahead log.
Content Repository: Stores the actual content bytes of FlowFiles, with a pluggable implementation.
Provenance Repository: Records all provenance event data, allowing for indexing and searching.

NiFi Cluster Architecture

NiFi can also function in a clustered environment with a zero-master clustering model:

Cluster Coordinator: Maintains cluster state information.
Primary Node: Executes all data flows and manages the cluster flow registry.
Each node in the cluster performs identical tasks on different data sets.

Scheduling Same-Day Data Fetch from MongoDB to Hadoop

Setting Up the Environment

To fetch data from MongoDB and transfer it to Hadoop using Apache NiFi, follow these steps:

Install Apache Hadoop
Install MongoDB and MongoDB Compass
Install Apache NiFi

Processing Flow in NiFi

We will use two processors in NiFi:

GetMongo
PutHDFS

Configuring the GetMongo Processor

Setup Scheduling Tab:

Scheduling Strategy: Use CRON Driven
Run Schedule: Set to 59 59 23 ? * MON-SUN to run at 23:59:59 from Monday to Sunday.

Setup Properties Tab:

Mongo URL: Connect to MongoDB: mongodb://host1[:port1]
Mongo Database Name: Specify the database name.
Mongo Collection Name: Specify the collection name.
Query: Use a JSON format query. Assume the collection in MongoDB has a field time_created for daily data queries.

  "$where": "find_data_by_date_range(this.time_created)"

Create the Stored Procedure Function: Define the find_data_by_date_range function in MongoDB:

db.system.js.insertOne({

  _id: "find_data_by_date_range",

  value: function (created_time) {

    var fromDate = new Date(new Date().setHours(0, 0, 0, 0));

    var toDate = new Date(new Date().setHours(23, 59, 59, 999));

    return (created_time >= fromDate && created_time <= toDate);

});

Configuring the PutHDFS Processor

Hadoop Configuration Resources: Provide Hadoop configuration resources.
Directory: Specify the directory on HDFS where data will be stored.

Summary

This guide details the process of transferring data from MongoDB to Hadoop using Apache NiFi. By following these steps, you can automate the data fetch and transfer process efficiently. If you have alternative solutions or improvements, please share your ideas in the comments.