Data Lake

Working with Azure Data Lake Storage

Apr 14, 2025

A data lake is a central place where you can store all types of data, whether organized (structured) or not (unstructured). It helps you store, access, and analyze different kinds of data in one location. Unlike traditional systems, a data lake doesn’t require you to organize your data into a specific format before storing it. You can save your data as it is, often as files or binary large objects.

Azure Data Lake Storage is a cloud-based solution designed for storing huge amounts of data in any format. It supports big data analysis and lets you collect all types of data, no matter how fast it comes in, in one place. This makes it easy to access and analyze the data using different tools and frameworks.

Organizing Storage

Organizing storage on data lake is crucial for efficient data management and processing. A well-structured hierarchical namespace can optimize performance and simplify data access.

A typical hierarchical structure includes:

Raw (Bronze) layer : raw data
Enriched (Silver) layer : cleaned and enriched data
Curated (Gold) layer : aggregated or analytics-ready data

For optimal data retrieval, implement a time-based directory structure:

/project-name/raw-files/year/month/day/hour/minute

Data Retention and Lifecycle Management

Azure Data Lake Storage offers sophisticated lifecycle management capabilities to optimize storage costs while maintaining data accessibility. The system automatically transitions data between different storage tiers based on access patterns and age.

Storage Tiers and Retention Periods

Hot Tier
The hot tier is designed for frequently accessed data with immediate availability and lowest access costs. This tier is ideal for data that's actively being processed or analyzed.

Cool Tier
Data that's accessed less frequently can be moved to the cool tier, which requires a minimum 30-day retention period. This tier offers lower storage costs but slightly higher access costs.

Cold Tier
For infrequently accessed data, the cold tier requires a 90-day minimum retention period. It provides even lower storage costs but higher retrieval costs compared to the cool tier.

Archive Tier
The archive tier is optimal for rarely accessed data with a minimum retention period of 180 days. While offering the lowest storage costs, it has the highest data retrieval costs and can take hours to access

Policy Implementation

Lifecycle management policies can be configured to:

Transition blobs between tiers based on last access time
Move data to cooler tiers after specified periods
Delete expired data automatically
Apply rules to specific containers or blob subsets

Access Control Mechanisms

Securing your data lake properly requires understanding the various access control methods available in Azure Data Lake Storage. Each mechanism serves different purposes and provides varying levels of granularity for controlling access to your data assets.

ACLs (Access Control Lists)

Access Control Lists provide fine-grained permissions at the file and directory level within your data lake. Azure Data Lake Storage implements a POSIX-like ACL system that should be familiar to users of Linux file systems.

Permission Types

ACLs in ADLS support three permission types:

Read (r): Allows viewing file contents and directory listings
Write (w): Enables creating, modifying, or deleting files and directories
Execute (x): Permits traversing directories (required to access child items)

RBAC (Role-Based Access Control)

Azure RBAC provides coarse-grained access control at the storage account or container level. This approach is well-suited for managing access for large groups of users or applications.

Built-in Azure Storage Roles

Azure offers several built-in roles specifically for data lake access:

Storage Blob Data Owner: Full access to blob storage containers and data, including ACL modification
Storage Blob Data Contributor: Read, write, and delete access to blob containers and data
Storage Blob Data Reader: Read-only access to blob containers and data

Implementing RBAC with Azure Portal

Navigate to your storage account in the Azure Portal
Select "Access Control (IAM)"
Click "Add role assignment"
Choose the appropriate role (e.g., Storage Blob Data Reader)
Select the users, groups, or service principals
Review and assign

SAS Tokens (Shared Access Signatures)

SAS tokens provide time-limited, scoped access to storage resources without requiring Microsoft Entra ID credentials. This mechanism is ideal for providing temporary access to external clients or applications.

SAS Token Types

Azure supports several types of SAS tokens:

Service SAS: Grants access to a specific service within a storage account
Account SAS: Grants access to one or more services in a storage account
User-delegated SAS: Created using Microsoft Entra credentials for enhanced security

SAS Token Parameters

A typical SAS token includes parameters like:

srt: Resource type (service, container, object)
sp: Permissions (read, write, delete, list)
st: Start time
se: Expiry time

Security Best Practices

When using SAS tokens:

Set the shortest possible validity period
Use HTTPS only (spr=https)
Consider using user-delegated SAS when possible
Regenerate account keys periodically if you're using account or service SAS

API Access with Blob APIs and SAS

Azure Data Lake Storage is accessible via the Blob Storage API, enabling various programming languages and tools to interact with your data lake. Understanding the API fundamentals and integration with SAS tokens is essential for creating robust data pipelines.

https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{sas_token}

Blob REST API Fundamentals

The Azure Blob Storage REST API provides a comprehensive set of operations for interacting with your data lake.

Core Operations

PUT: Upload blobs to your storage account
GET: Download blob content
LIST: Enumerate containers and blobs
DELETE: Remove blobs and containers

Authentication Mechanisms

The Blob API supports multiple authentication methods:

Shared Key: Using the storage account access key
SAS Token: Using a Shared Access Signature
Microsoft Entra ID: Using OAuth tokens

Security Best Practices

Encryption Options

Azure Data Lake Storage offers multiple encryption mechanisms to secure your data:

Storage Service Encryption (SSE): Automatically encrypts data at rest using Microsoft-managed keys
Customer-managed keys (CMK): Store encryption keys in Azure Key Vault for enhanced control
Encryption in transit: All data is encrypted using TLS 1.2 during transfer

When implementing a robust security strategy:

Consider using customer-managed keys for sensitive data
Implement a key rotation schedule
Enable encryption for temp disks and data flows between compute and storage resources

Network Isolation

Protect your data lake from unauthorized access through network isolation:

Private Endpoints: Create a private IP address for your storage account within your VNet
Service Endpoints: Limit storage account access to specific virtual networks
Network Rules: Configure IP-based access restrictions

Data Governance

Metadata Management

Enhance data discoverability and management with a robust metadata strategy:

Implement a consistent tagging system for containers and directories
Use metadata to track data lineage, ownership, and sensitivity
Consider integrating with Azure Purview for comprehensive data cataloging

Audit Logging and Monitoring

Maintain visibility into data lake activities:

Enable diagnostic logs for storage accounts
Track all authenticated REST API requests
Monitor operations on data including creation, deletion, and access
Configure log retention based on compliance requirements

Data Lake

Working with Azure Data Lake Storage

Organizing Storage

Data Retention and Lifecycle Management

Storage Tiers and Retention Periods

Access Control Mechanisms

ACLs (Access Control Lists)

Permission Types

RBAC (Role-Based Access Control)

Built-in Azure Storage Roles

Implementing RBAC with Azure Portal

SAS Tokens (Shared Access Signatures)

SAS Token Types

SAS Token Parameters

Security Best Practices

API Access with Blob APIs and SAS

Blob REST API Fundamentals

Core Operations

Authentication Mechanisms

Security Best Practices

Encryption Options

Network Isolation

Data Governance

Metadata Management

Audit Logging and Monitoring

Discussion about this post