Implementation details
Multi-Tenant, Scalable, and Highly Available On-Prem Data Platform
Overview
Design and deployment of a robust on-premises data platform supporting multi-tenant architecture, scalability, and high availability.
Key components
Talend Data Management Platform – for data integration and quality
Talend Data Catalog Platform – for metadata management, data discovery and lineage
VMware Greenplum MPP Database – for massive parallel data processing
Outcome
Seamless integration into a complex government IT environment, followed by extensive performance and functional validation across diverse usage scenarios and data projects.
Implementation details – Data Integration and Quality
Client has opted for Talend Data Management Platform which includes the following capabilities:
Design and Productivity Tools (Studio)
Talend Studio is a software that you download and install to visually create and test Jobs. Studio features include:
- Control and orchestrate data flows and data integrations with master Jobs
- Map, aggregate, sort, enrich and merge data
- Team collaboration with shared repository
- Continuous integration
- Audit, Job compare, impact analysis, testing, debugging and tuning
- Metadata bridge for metadata import/export and centralized metadata management
- Distant run and parallelization
- Dynamic schema, re-usable Joblets and reference projects
- Wizards and interactive data viewer
- Versioning
- Export and execute standalone Jobs in runtime environments
- Automatic documentation
- Controlled patch management
Studio Connectors
Talend Studio includes the following connectors for Job creation:
- RDBMS, Streaming Message Queues, Cloud DB, Cloud Storage, SaaS / Business, Big Data, DB for Analytics
Full list of components:
https://www.talendforge.org/components/index.php?version=255&edition=8&showAll=1
Management and Monitoring for Jobs
Talend Administration Center, a software to manage Talend applications and components as well as the administrative features and configurations that surround them:
- Ability to manage or view users, permissions, projects, execution engines
- Real-Time statistics to track down rejected records or where executions have failed
- Design and schedule plans to chain or parallelize tasks including error recovery
- Time and event-based scheduler for tasks and plans
- Job execution logs are collected and can be viewed
- Audit logs are stored in files for reference and compliance
- High availability, load balancing, failover for tasks and plans executions
- Engine clusters for Jobs
- Single Sign-On (SSO) integration with several SSO providers
Data Quality
Talend Data Management Platform includes data quality features to profile, cleanse and mask data. Data quality features include:
- Data profiling and analytics with graphical charts and drilldown data
- Data privacy with masking and encryption
- Automated data standardization, cleansing and rules enforcement
- Data quality data mart containing the analyses and reports executed in Talend Studio
- Semantic discovery with automatic detection of patterns
- Data sampling
- Enrichment, harmonization, fuzzy matching and de-duplication
- Pattern library
- Advanced Data Profiling:
- Fraud pattern detection using Benford Law
- Advanced statistics with indicator thresholds
- Column set analysis
- Advanced matching analysis
- Time column correlation analysis
Detalji implementacije: Data Catalog
Client has opted for Qlik Talend Data Catalog Advanced Edition whic includes the following capabilities:
- Faceted search, data sampling, semantic discovery, categorizing and auto-profiling
- Social curation with data tagging, comments, review, promotion, certification
- Data relationship discovery and certification
- Automatic discovery of the data lake and other data stores
Design and Productivity Tools
- Metadata search & analysis
- Business glossary
- Metadata documentation & enrichment
Bridges
- Crawling and harvesting from most supported data RDBMS
- Harvesting from Talend Data Integration and Talend Data Preparation
- Harvesting from Tableau, Qlik Sense, Salesforce.com
- HiveQL Parsing
- Harvesting from most supported tools for data modeling, business intelligence, data integration
- Harvesting from most DM/DI tools, Supported SQL, BI and MM tool
- Spark with Python or Scala parsing
Management and Monitoring
- Metadata documentation and end-to-end data lineage
- Impact analysis and change alerts
- Active/passive failover switching
- Customizable UI and REST API
- Multi version and configuration control system
- Approval workflows for business glossary authoring