Comparison: Data Preparation vs. Inline Data Wrangling in Machine Learning and Deep Learning Projects

I want to highlight a new presentation about Data Preparation in Data Science projects:

“Comparison of Programming Languages, Frameworks and Tools for Data Preprocessing and (Inline) Data Wrangling  in Machine Learning / Deep Learning Projects”

Data Preparation as Key for Success in Data Science Projects

A key task to create appropriate analytic models in machine learning or deep learning is the integration and preparation of data sets from various sources like files, databases, big data storages, sensors or social networks. This step can take up to 80% of the whole project.

This session compares different alternative techniques to prepare data, including extract-transform-load (ETL) batch processing (like Talend, Pentaho), streaming analytics ingestion (like Apache Storm, Flink, Apex, TIBCO StreamBase, IBM Streams, Software AG Apama), and data wrangling (DataWrangler, Trifacta) within visual analytics. Various options and their trade-offs are shown in live demos using different advanced analytics technologies and open source frameworks such as R, Python, Apache Hadoop, Spark, KNIME or RapidMiner. The session discusses how this is related to visual analytics tools (like TIBCO Spotfire). Therefore, it also shows best practices for how the data scientist and business analyst should work together to build good analytic models.

Key Takeaway: Inline Data Wrangling Within Visual Analytics Tooling

Key takeaways of this session:

–    Learn various options for preparing data sets to build analytic models
–    Understand the pros and cons and the targeted persona for each option
–    See different technologies and open source frameworks for data preparation
–    Understand the relation to visual analytics and streaming analytics, and how these concepts are actually leveraged to build the analytic model after data preparation

Slide Deck

The following shows the slide deck:

You are currently viewing a placeholder content from Default. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.

More Information

Video Recording: Data Preparation vs. (Inline) Data Wrangling

Here is the video recording:

Kai Waehner

bridging the gap between technical innovation and business value for real-time data streaming and applied AI.

Recent Posts

ERP Migration to SAP S/4HANA and Beyond: Lessons Learned from German Manufacturing

ERP modernization fails when the technology leads and the process work follows. Three German manufacturers…

6 days ago

Beyond Enterprise Data Lineage: The Case for a Platform-Independent Data Catalog

Most organizations start their data governance journey by asking how to track where data comes…

3 weeks ago

Data Ownership in the Age of Agentic AI: Why SAP’s API Policy Forces a Data Integration Reckoning for Every Enterprise

Every enterprise is being told to go agentic. Meanwhile, the platforms holding your most critical…

1 month ago

Flink CEP and Agentic AI: Real-Time Pattern Detection as the Foundation for Autonomous Decisions

AI agents fail in production when they are connected directly to raw event streams. Flink…

1 month ago

Complex Event Processing (CEP) with Apache Flink: What It Is and When (Not) to Use It

Complex Event Processing is the most underused capability in Apache Flink. It detects meaningful event…

2 months ago

MCP vs. REST/HTTP API vs. Kafka: The Architect’s Guide to Agentic AI Integration

MCP, REST/HTTP APIs, and Apache Kafka are not alternatives. They solve different problems at different…

2 months ago