Aurimas Griciลซnas (@Aurimas_Gr)

2025-01-15 | โค๏ธ 311 | ๐Ÿ” 71


๐——๐—ฎ๐˜๐—ฎ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€ ๐—ถ๐—ป ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€ can become complex and for a good reason ๐Ÿ‘‡

It is critical to ensure Data Quality and Integrity upstream of ML Training and Inference Pipelines, trying to do that in the downstream systems will cause unavoidable failure when working at scale.

It is a good idea to start thinking about the quality of your data at the point of creation (the producers). This is where you can also start to utilise Data Contracts.

Example architecture for a production grade end-to-end data flow:

๐Ÿญ: Schema changes are implemented in version control, once approved - they are pushed to the Applications generating the Data, Databases holding the Data and a central Data Contract Registry.

[๐—œ๐—บ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฎ๐—ป๐˜]: Ideally you should be enforcing a Data contract at this stage, when producing Data. Data Validation steps down the stream are Detection and Prevention mechanisms that donโ€™t allow low quality data to reach downstream systems. There might be a significant delay before you can do those checks, causing irreversible corruption or loss of data.

Applications push generated Data to Kafka Topics:

๐Ÿฎ: Events emitted directly by the Application Services.

๐Ÿ‘‰ This also includes IoT Fleets and Website Activity Tracking.

๐Ÿฎ.๐Ÿญ: Raw Data Topics for CDC streams.

๐Ÿฏ: A Flink Application(s) consumes Data from Raw Data streams and validates it against schemas in the Contract Registry. ๐Ÿฐ: Data that does not meet the contract is pushed to Dead Letter Topic. ๐Ÿฑ: Data that meets the contract is pushed to Validated Data Topic. ๐Ÿฒ: Data from the Validated Data Topic is pushed to object storage for additional Validation. ๐Ÿณ: On a schedule Data in the Object Storage is validated against additional SLAs in Data Contracts and is pushed to the Data Warehouse to be Transformed and Modeled for Analytical purposes. ๐Ÿด: Modeled and Curated data is pushed to the Feature Store System for further Feature Engineering. ๐Ÿด.๐Ÿญ: Real Time Features are ingested into the Feature Store directly from Validated Data Topic (5).

๐Ÿ‘‰ Ensuring Data Quality here is complicated since checks against SLAs is hard to perform.

๐Ÿต: High Quality Data is used in Machine Learning Training Pipelines. ๐Ÿญ๐Ÿฌ: The same Data is used for Feature Serving in Inference.

Note: ML Systems are plagued by other Data related issues like Data and Concept Drifts. These are silent failures and while they can be monitored, we donโ€™t include it in the Data Contract.

Let me know your thoughts! ๐Ÿ‘‡

MachineLearning DataEngineering AI

Want to learn first principals of Agentic systems from scratch? Follow my journey here: https://www.newsletter.swirlai.com/p/building-ai-agents-from-scratch-part

๐Ÿ”— ์›๋ณธ ๋งํฌ

๋ฏธ๋””์–ด

video


Auto-generated - needs manual review

Tags

domain-genai domain-visionos