Are you over 18 and want to see adult content?
More Annotations
A complete backup of visitliverpool.com
Are you over 18 and want to see adult content?
A complete backup of markets98.blogsky.com
Are you over 18 and want to see adult content?
A complete backup of shoppersgossip.com
Are you over 18 and want to see adult content?
A complete backup of livingindubai.org
Are you over 18 and want to see adult content?
A complete backup of weatherforecaster.net
Are you over 18 and want to see adult content?
A complete backup of patribotics.blog
Are you over 18 and want to see adult content?
A complete backup of jobsjobscareers.com
Are you over 18 and want to see adult content?
Favourite Annotations
Buscar pareja • encuentros para solteros | Parship.es
Are you over 18 and want to see adult content?
Electronic repair articles
Are you over 18 and want to see adult content?
脱サラ起業し、アフィリエイト、メルマガを使い月収203万円を、在宅で稼ぐ26歳 | 副業ネットビジネス、アフィリエイトで稼ぐための動画講義
Are you over 18 and want to see adult content?
Hot Shot's Secret - The Original Stiction Eliminator
Are you over 18 and want to see adult content?
Technology Enhanced Learning Services and Solutions for next generation learning
Are you over 18 and want to see adult content?
Bandori Art Translations!
Are you over 18 and want to see adult content?
Text
→.
MODERN DATA WAREHOUSE & REVERSE ETL An extension to the Modern Data Warehouse (MDW) that I have heard a bit about lately is called “Reverse ETL”. Before I describe what that is, first I wanted to give a quick review of a typical MDW, which consists of five stages:DATA MESH DEFINED
The two latest trends in emerging data platform architectures are the Data Lakehouse (the subject of my last blog Data Lakehouse defined), and the Data Mesh, the subject of this blog.. Data Mesh was first introduced by ThoughtWorks via the blog How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.From that blog is the graphic (Data mesh architecture from 30,000 foot view): DATABRICKS DELTA LAKE James Serra's Blog. Posted on October 15, 2019 by James Serra. November 16, 2020. Introduced in April 2019, Databricks Delta Lake is, in short, a transactional storage layer that runs on top of cloud storage such as Azure Data Lake Storage (ADLS) Gen2 and adds a layer of reliability to organizational data lakes by enabling many featuressuch as
THE MICROSOFT POWER PLATFORM The Microsoft Power Platform consists of three products: Power BI, PowerApps, and Microsoft Flow (now called Power Automate). I find customers are confused on the use cases of these products and how they compare to other products (Azure Functions, Logic Apps, and Azure Data Factory).So I wanted to write a blog and describe at a high level what these products are, their primary TOP SQL SERVER MIGRATION QUESTIONS Posted on July 29, 2019 by James Serra. July 29, 2019. Below are the top 12 questions I am seeing from customers looking to migrate on-prem SQL Server to Azure, and the blogs that I have wrote that try to answer each question (I have updated most of these blogs): How does PaaS database high availability work? Azure SQL Database highavailability.
DATA GATEWAY CONFUSION Data gateway confusion. Microsoft has created data gateways as a way that cloud products such as Power BI and Azure Machine Learning can access on-prem data, often called a hybrid model. This is most useful for companies that do not want to put their data in the cloud, or there is simply too much data to upload to the cloud. The confusion is IMPROVING CUBE PROCESSING TIME Partitioning the cube can help to reduce the processing time. So can using a different processing strategy than a Process Full. The biggest benefit of partitioning is that it allows you to process multiple partitions in parallel on a server that has multiple processors. This can greatly reduce the total cube processing time. THINKING ABOUT TAKING A CONTRACT POSITION? QUESTIONS TO Many of these questions will also apply if you are thinking about working as a salaried employee for a consulting firm: – What is the job description? – What is the length of the contract? – What is the pay rate? – What is the company name of your client that I would be working at? – Where is the work location? HOW TO RUN PROGRAMS AS A DOMAIN USER The two “desirable” solutions are: 1) Use the RUNAS command with an option that most people have not noticed: /netonly. The makes the command run as your local user, but uses the supplied domain credentials only when accessing the network. Without the /netonly switch the RUNAS command will fail because it tries to run the commandlocally as
JAMES SERRA'S BLOGFAQPRESENTATIONSCONSULTINGABOUT MECONTACT MESECURITY James Serra's Blog. Posted on April 27, 2021 by James Serra. April 26, 2021. I created a short YouTube video (20 minutes) that is a whiteboarding session that describes the five stages (ingest, store, transform, model, visualize) that make up a modern data warehouse (MDW) and the Azure products that you can use for . Continue reading→.
MODERN DATA WAREHOUSE & REVERSE ETL An extension to the Modern Data Warehouse (MDW) that I have heard a bit about lately is called “Reverse ETL”. Before I describe what that is, first I wanted to give a quick review of a typical MDW, which consists of five stages:DATA MESH DEFINED
The two latest trends in emerging data platform architectures are the Data Lakehouse (the subject of my last blog Data Lakehouse defined), and the Data Mesh, the subject of this blog.. Data Mesh was first introduced by ThoughtWorks via the blog How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.From that blog is the graphic (Data mesh architecture from 30,000 foot view): DATABRICKS DELTA LAKE James Serra's Blog. Posted on October 15, 2019 by James Serra. November 16, 2020. Introduced in April 2019, Databricks Delta Lake is, in short, a transactional storage layer that runs on top of cloud storage such as Azure Data Lake Storage (ADLS) Gen2 and adds a layer of reliability to organizational data lakes by enabling many featuressuch as
THE MICROSOFT POWER PLATFORM The Microsoft Power Platform consists of three products: Power BI, PowerApps, and Microsoft Flow (now called Power Automate). I find customers are confused on the use cases of these products and how they compare to other products (Azure Functions, Logic Apps, and Azure Data Factory).So I wanted to write a blog and describe at a high level what these products are, their primary TOP SQL SERVER MIGRATION QUESTIONS Posted on July 29, 2019 by James Serra. July 29, 2019. Below are the top 12 questions I am seeing from customers looking to migrate on-prem SQL Server to Azure, and the blogs that I have wrote that try to answer each question (I have updated most of these blogs): How does PaaS database high availability work? Azure SQL Database highavailability.
DATA GATEWAY CONFUSION Data gateway confusion. Microsoft has created data gateways as a way that cloud products such as Power BI and Azure Machine Learning can access on-prem data, often called a hybrid model. This is most useful for companies that do not want to put their data in the cloud, or there is simply too much data to upload to the cloud. The confusion is IMPROVING CUBE PROCESSING TIME Partitioning the cube can help to reduce the processing time. So can using a different processing strategy than a Process Full. The biggest benefit of partitioning is that it allows you to process multiple partitions in parallel on a server that has multiple processors. This can greatly reduce the total cube processing time. THINKING ABOUT TAKING A CONTRACT POSITION? QUESTIONS TO Many of these questions will also apply if you are thinking about working as a salaried employee for a consulting firm: – What is the job description? – What is the length of the contract? – What is the pay rate? – What is the company name of your client that I would be working at? – Where is the work location? HOW TO RUN PROGRAMS AS A DOMAIN USER The two “desirable” solutions are: 1) Use the RUNAS command with an option that most people have not noticed: /netonly. The makes the command run as your local user, but uses the supplied domain credentials only when accessing the network. Without the /netonly switch the RUNAS command will fail because it tries to run the commandlocally as
DEVOPS FOR DATABASES: "DATAOPS" DevOps, a set of practices that combines software development (Dev) and information-technology operations (Ops), has become a very popular way to shorten the systems development life cycle and provide continuous delivery of applications (“software”). The implementation of continuous delivery and DevOps to data analytics has been termed DataOps, which is the topic of this blog. SQL ON-DEMAND IN AZURE SYNAPSE ANALYTICS The new SQL on-demand (to be renamed SQL serverless) offering within Azure Synapse Analytics opens up a whole new world of getting insights into data immediately, at low-cost, using familiar tools/languages. There is no infrastructure to setup or clusters to maintain. It uses a pay-per-query model (you are only being charged per query for the data processed by the query – see pricing) DELTA LAKE | JAMES SERRA'S BLOG Microsoft Ignite has always announced many new products and new product features, and this year was no exception. Many exciting announcements, and below I list the major data platform related announcements: Usage-based optimization with Azure Synapse and PowerBI: A new
SQL DATABASE MI BENEFITS SQL Database MI benefits. As I see a huge number of customers migrating their on-prem databases to the Azure cloud, the main question they ask is about whether they should go with an IaaS solution (SQL Server in a VM) or a PaaS solution (SQL Database). Because SQL Database MI (Managed Instance) has near 100% compatibilitywith on-premise SQL
DATA GATEWAY CONFUSION Data gateway confusion. Microsoft has created data gateways as a way that cloud products such as Power BI and Azure Machine Learning can access on-prem data, often called a hybrid model. This is most useful for companies that do not want to put their data in the cloud, or there is simply too much data to upload to the cloud. The confusion is ACCESSING MANAGED INSTANCE VIA SSMS This info will also be used to login via SSMS. So, to login via SSMS, gather the server name and admin login/password mentioned above, choose to connect to a database engine in SSMS, and then you will enter that info into the connection screen similar to the one below, using SQL Server Authentication. Again, note the “public” in theserver
OPERATIONAL DATA STORE (ODS) DEFINED An ODS is meant for operational reporting and supports current or near real-time reporting requirements whereas a data warehouse is meant for historical and trend analysis reporting usually on a large volume of data. An ODS contains only a short window of data, while a data warehouse contains the entire history of data. DATA VIRTUALIZATION VS DATA WAREHOUSE Data Virtualization vs Data Warehouse. Data virtualization goes by a lot of different names: logical data warehouse, data federation, virtual database, and decentralized data warehouse. Data virtualization allows you to integrate data from various sources, keeping the data in-place, so that you can generate reports and dashboards to create DATA WAREHOUSE FAST TRACK REFERENCE GUIDE FOR SQL SERVER Data Warehouse Fast Track Reference Guide for SQL Server 2016. I had previously blogged about the Data Warehouse Fast Track for SQL Server 2016, a joint effort between Microsoft and its hardware partners to deliver validated, pre-configured solutions that reduce the complexity of implementing a data warehouse on SQL Server Enterprise Edition. METHODS FOR POPULATING A DATA WAREHOUSE Methods for populating a data warehouse. Source tables change over time. A data mart or data warehouse that is based on those tables needs to reflect these changes. This blog post will explain different solutions for solving this problem. When moving data into a data warehouse, taking it from a source system is the first step in the ETLprocess
JAMES SERRA'S BLOGFAQPRESENTATIONSCONSULTINGABOUT MECONTACT MESECURITY James Serra's Blog. Posted on April 27, 2021 by James Serra. April 26, 2021. I created a short YouTube video (20 minutes) that is a whiteboarding session that describes the five stages (ingest, store, transform, model, visualize) that make up a modern data warehouse (MDW) and the Azure products that you can use for . Continue reading→.
MODERN DATA WAREHOUSE & REVERSE ETL An extension to the Modern Data Warehouse (MDW) that I have heard a bit about lately is called “Reverse ETL”. Before I describe what that is, first I wanted to give a quick review of a typical MDW, which consists of five stages:DATA MESH DEFINED
The two latest trends in emerging data platform architectures are the Data Lakehouse (the subject of my last blog Data Lakehouse defined), and the Data Mesh, the subject of this blog.. Data Mesh was first introduced by ThoughtWorks via the blog How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.From that blog is the graphic (Data mesh architecture from 30,000 foot view): DATABRICKS DELTA LAKE James Serra's Blog. Posted on October 15, 2019 by James Serra. November 16, 2020. Introduced in April 2019, Databricks Delta Lake is, in short, a transactional storage layer that runs on top of cloud storage such as Azure Data Lake Storage (ADLS) Gen2 and adds a layer of reliability to organizational data lakes by enabling many featuressuch as
THE MICROSOFT POWER PLATFORM The Microsoft Power Platform consists of three products: Power BI, PowerApps, and Microsoft Flow (now called Power Automate). I find customers are confused on the use cases of these products and how they compare to other products (Azure Functions, Logic Apps, and Azure Data Factory).So I wanted to write a blog and describe at a high level what these products are, their primary TOP SQL SERVER MIGRATION QUESTIONS Posted on July 29, 2019 by James Serra. July 29, 2019. Below are the top 12 questions I am seeing from customers looking to migrate on-prem SQL Server to Azure, and the blogs that I have wrote that try to answer each question (I have updated most of these blogs): How does PaaS database high availability work? Azure SQL Database highavailability.
DATA GATEWAY CONFUSION Data gateway confusion. Microsoft has created data gateways as a way that cloud products such as Power BI and Azure Machine Learning can access on-prem data, often called a hybrid model. This is most useful for companies that do not want to put their data in the cloud, or there is simply too much data to upload to the cloud. The confusion is IMPROVING CUBE PROCESSING TIME Partitioning the cube can help to reduce the processing time. So can using a different processing strategy than a Process Full. The biggest benefit of partitioning is that it allows you to process multiple partitions in parallel on a server that has multiple processors. This can greatly reduce the total cube processing time. THINKING ABOUT TAKING A CONTRACT POSITION? QUESTIONS TO Many of these questions will also apply if you are thinking about working as a salaried employee for a consulting firm: – What is the job description? – What is the length of the contract? – What is the pay rate? – What is the company name of your client that I would be working at? – Where is the work location? HOW TO RUN PROGRAMS AS A DOMAIN USER The two “desirable” solutions are: 1) Use the RUNAS command with an option that most people have not noticed: /netonly. The makes the command run as your local user, but uses the supplied domain credentials only when accessing the network. Without the /netonly switch the RUNAS command will fail because it tries to run the commandlocally as
JAMES SERRA'S BLOGFAQPRESENTATIONSCONSULTINGABOUT MECONTACT MESECURITY James Serra's Blog. Posted on April 27, 2021 by James Serra. April 26, 2021. I created a short YouTube video (20 minutes) that is a whiteboarding session that describes the five stages (ingest, store, transform, model, visualize) that make up a modern data warehouse (MDW) and the Azure products that you can use for . Continue reading→.
MODERN DATA WAREHOUSE & REVERSE ETL An extension to the Modern Data Warehouse (MDW) that I have heard a bit about lately is called “Reverse ETL”. Before I describe what that is, first I wanted to give a quick review of a typical MDW, which consists of five stages:DATA MESH DEFINED
The two latest trends in emerging data platform architectures are the Data Lakehouse (the subject of my last blog Data Lakehouse defined), and the Data Mesh, the subject of this blog.. Data Mesh was first introduced by ThoughtWorks via the blog How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.From that blog is the graphic (Data mesh architecture from 30,000 foot view): DATABRICKS DELTA LAKE James Serra's Blog. Posted on October 15, 2019 by James Serra. November 16, 2020. Introduced in April 2019, Databricks Delta Lake is, in short, a transactional storage layer that runs on top of cloud storage such as Azure Data Lake Storage (ADLS) Gen2 and adds a layer of reliability to organizational data lakes by enabling many featuressuch as
THE MICROSOFT POWER PLATFORM The Microsoft Power Platform consists of three products: Power BI, PowerApps, and Microsoft Flow (now called Power Automate). I find customers are confused on the use cases of these products and how they compare to other products (Azure Functions, Logic Apps, and Azure Data Factory).So I wanted to write a blog and describe at a high level what these products are, their primary TOP SQL SERVER MIGRATION QUESTIONS Posted on July 29, 2019 by James Serra. July 29, 2019. Below are the top 12 questions I am seeing from customers looking to migrate on-prem SQL Server to Azure, and the blogs that I have wrote that try to answer each question (I have updated most of these blogs): How does PaaS database high availability work? Azure SQL Database highavailability.
DATA GATEWAY CONFUSION Data gateway confusion. Microsoft has created data gateways as a way that cloud products such as Power BI and Azure Machine Learning can access on-prem data, often called a hybrid model. This is most useful for companies that do not want to put their data in the cloud, or there is simply too much data to upload to the cloud. The confusion is IMPROVING CUBE PROCESSING TIME Partitioning the cube can help to reduce the processing time. So can using a different processing strategy than a Process Full. The biggest benefit of partitioning is that it allows you to process multiple partitions in parallel on a server that has multiple processors. This can greatly reduce the total cube processing time. THINKING ABOUT TAKING A CONTRACT POSITION? QUESTIONS TO Many of these questions will also apply if you are thinking about working as a salaried employee for a consulting firm: – What is the job description? – What is the length of the contract? – What is the pay rate? – What is the company name of your client that I would be working at? – Where is the work location? HOW TO RUN PROGRAMS AS A DOMAIN USER The two “desirable” solutions are: 1) Use the RUNAS command with an option that most people have not noticed: /netonly. The makes the command run as your local user, but uses the supplied domain credentials only when accessing the network. Without the /netonly switch the RUNAS command will fail because it tries to run the commandlocally as
DEVOPS FOR DATABASES: "DATAOPS" DevOps, a set of practices that combines software development (Dev) and information-technology operations (Ops), has become a very popular way to shorten the systems development life cycle and provide continuous delivery of applications (“software”). The implementation of continuous delivery and DevOps to data analytics has been termed DataOps, which is the topic of this blog. SQL ON-DEMAND IN AZURE SYNAPSE ANALYTICS The new SQL on-demand (to be renamed SQL serverless) offering within Azure Synapse Analytics opens up a whole new world of getting insights into data immediately, at low-cost, using familiar tools/languages. There is no infrastructure to setup or clusters to maintain. It uses a pay-per-query model (you are only being charged per query for the data processed by the query – see pricing) DELTA LAKE | JAMES SERRA'S BLOG Microsoft Ignite has always announced many new products and new product features, and this year was no exception. Many exciting announcements, and below I list the major data platform related announcements: Usage-based optimization with Azure Synapse and PowerBI: A new
SQL DATABASE MI BENEFITS SQL Database MI benefits. As I see a huge number of customers migrating their on-prem databases to the Azure cloud, the main question they ask is about whether they should go with an IaaS solution (SQL Server in a VM) or a PaaS solution (SQL Database). Because SQL Database MI (Managed Instance) has near 100% compatibilitywith on-premise SQL
DATA GATEWAY CONFUSION Data gateway confusion. Microsoft has created data gateways as a way that cloud products such as Power BI and Azure Machine Learning can access on-prem data, often called a hybrid model. This is most useful for companies that do not want to put their data in the cloud, or there is simply too much data to upload to the cloud. The confusion is ACCESSING MANAGED INSTANCE VIA SSMS This info will also be used to login via SSMS. So, to login via SSMS, gather the server name and admin login/password mentioned above, choose to connect to a database engine in SSMS, and then you will enter that info into the connection screen similar to the one below, using SQL Server Authentication. Again, note the “public” in theserver
OPERATIONAL DATA STORE (ODS) DEFINED An ODS is meant for operational reporting and supports current or near real-time reporting requirements whereas a data warehouse is meant for historical and trend analysis reporting usually on a large volume of data. An ODS contains only a short window of data, while a data warehouse contains the entire history of data. DATA VIRTUALIZATION VS DATA WAREHOUSE Data Virtualization vs Data Warehouse. Data virtualization goes by a lot of different names: logical data warehouse, data federation, virtual database, and decentralized data warehouse. Data virtualization allows you to integrate data from various sources, keeping the data in-place, so that you can generate reports and dashboards to create DATA WAREHOUSE FAST TRACK REFERENCE GUIDE FOR SQL SERVER Data Warehouse Fast Track Reference Guide for SQL Server 2016. I had previously blogged about the Data Warehouse Fast Track for SQL Server 2016, a joint effort between Microsoft and its hardware partners to deliver validated, pre-configured solutions that reduce the complexity of implementing a data warehouse on SQL Server Enterprise Edition. METHODS FOR POPULATING A DATA WAREHOUSE Methods for populating a data warehouse. Source tables change over time. A data mart or data warehouse that is based on those tables needs to reflect these changes. This blog post will explain different solutions for solving this problem. When moving data into a data warehouse, taking it from a source system is the first step in the ETLprocess
JAMES SERRA'S BLOG
Big Data and Data WarehousingSkip to content
* Home
* FAQ
* Presentations
* Consulting
* About Me
* Contact Me
← Older posts
AZURE STACK AND AZURE ARC FOR DATA SERVICES Posted on November 18, 2020by James Serra
For those companies that can’t yet move to the cloud, have certain workloads that can’t move to the cloud, or have limited to no internet access, Microsoft has options to build your own private on-prem cloud via Azure Stackand Azure
Arc . I’ll
focus this blog on using these products to host your databases. Azure Stack is an extension of Azure that provides a way to run apps and databases in an on-premises environment and deliver Azure servicesvia three options:
AZURE STACK HUB: Run your own private, autonomous cloud—connected or disconnected with cloud-native apps using consistent Azure services on-premises. Azure Stack Hub integrated systems are comprised in racks of 4-16 servers built by trusted hardware partners and delivered straight to your datacenter. Azure Stack Hub is built on industry standard hardware and is managed using the same tools you already use for managing Azure subscriptions. As a result, you can apply consistent DevOps processes whether you’re connected to Azure or not. The Azure Stack Hub architecture lets you provide Azure services for remote locations with intermittent connectivity or disconnected from the internet. You can also create hybrid solutions that process data locally in Azure Stack Hub and then aggregate it in Azure for additional processing and analytics. Finally, because Azure Stack Hub is installed on-premises, you can meet specific regulatory or policy requirements with the flexibility of deploying cloud apps on-premises without changing any code. See Azure Stack Hub overview AZURE STACK EDGE: Get rapid insights with an Azure-managed appliance using compute and hardware-accelerated machine learning at edge locations for your Internet of Things (IoT) and AI workloads. Think of it as a much smaller version of Azure Stack Hub that uses purpose-built hardware-as-a-service such as Pro GPU,
Pro FPGA
,
Pro R
,
and Mini R
.
The Mini is designed to work in the harshest environment conditions, supporting scenarios such as tactical edge, humanitarian and emergency response efforts. See Azure Stack Edge documentation AZURE STACK HCI (preview): A hyperconverged infrastructure (HCI) cluster solution that hosts virtualized Windows and Linux workloads and their storage in a hybrid on-premises environment. Think of it as a virtualization fabric for VM or kubernetes hosting – software only to put on your certified hardware. See Azure Stack HCI solutionoverview
These Azure Stack options are almost all VMs/IaaS, with no PaaS options for data services such as SQL Database (the only data service available is SQL Server in a VM). It is integrated certified hardware and software run by Microsoft, just plug in and go. For support, there is “one throat to choke” as the saying goes. It is a great option if you are disconnected from Azure. It extends Azure management and security to any infrastructure and provides flexibility in deployment of applications, making management more consistent (a single view for on-prem, clouds, and edge). It brings the Azure fabric to your own data center but allows you to use your own security requirements. Microsoft orchestrates the upgrades of hardware, firmware, and software, but you control when those updates happen. AZURE ARC is a software only solution that can be deployed on any hardware, including Azure Stack, AWS, or your own hardware. With Azure Arc and Azure Arc-enabled data services (preview) you can deploy Azure SQL Managed Instance (SQL MI) and Azure Database for PostgreSQL Hyperscaleto any of
these environments, which requires kubernetes. It can also manage SQL Server in a VM by just installing an agent on the SQL server (see Preview of Azure Arc enabled SQL Server is now available).
Any of these databases can then be easily moved from your hardware to Azure down the road. It allows you to extend Azure management across your environments, adopt cloud practices on-premises, and implement Azure security anywhere you choose. This allows for many options to use Azure Arc on Azure Stack or on other platforms (click to expand): Some features about Azure Arc: * It can be used to solve for data residency requirements (datasovereignty)
* It is supported in disconnected and intermittently connected scenarios such as air gapped private data centers, cruise ships that are off the grid for multiple weeks, factory floors that have occasional disconnects due to power outages, etc. * Customers can use Azure Data Studio (instead of the Azure Portal) to manage their data estate when operating in a disconnected/intermittent connected mode * Could eventually support other products like Azure SynapseAnalytics
* Can use larger hardware solutions and more hardware tiers then what is available in Azure, but have to do your own HA/DR * You are not charged if you shut down SQL MI, unlike in Azure, as it’s your hardware, where in Azure the hardware is dedicated to you even if you are not using it * With Arc you are managing the hardware, but with Stack Microsoft is managing the hardware * Can use modern cloud billing models on-premises for better costefficiency
* With Azure Arc enabled SQL Server, you can use the Azure Portal to register and track the inventory of your SQL Server instances across on-premises, edge sites, and multi-cloud in a single view. You can also take advantage of Azure security services, such as Azure SecurityCenter
and Azure Sentinel
, as well
as use the SQL Assessmentservice
* Azure Stack hub provides consistent hardware, but if you use your own hardware you have more flexibility and possibly cheaper hardwarecosts
These slides covers the major benefits of Azure Arc and what the architecture looks like: Looking at the differences when you are connected directly vs connected indirectly (i.e. an Arc server is not connected to the Internet so must coordinate with a server that is connected): Here is what an Azure Arc data services architecture looks like: Some of the top use cases we see with customers using Azure Stackand/or Azure Arc:
* Cloud-to-cloud failover * On-prem databases with failover to cloud * Easier migration: Deploy locally, then flip a switch to go to cloud This slide provides details on the differences with SQL databases(click to expand):
More info:
Understanding Azure Arc Enabled SQL Server What is Azure Arc Enabled SQL Managed InstancePosted in Azure Arc
, Azure SQL
Database ,
Azure Stack
,
SQLServerPedia Syndication| Leave a comment
EXTERNAL TABLES VS T-SQL VIEWS ON FILES IN A DATA LAKE Posted on November 3, 2020by James Serra
A question that I have been hearing recently from customers using Azure Synapse Analytics(the
public preview version)
is what is the difference between using an external tableversus a T-SQL view
on a file in a data lake? Note that a T-SQL view and an external table pointing to a file in a data lake can be created in both a SQL Provisioned pool as well as aSQL On-demand pool.
Here are the differences that I have found: * Overall summary: views are generally faster and have more featuressuch as OPENROWSET
* Virtual functions (filepath and filename)
are not supported with external tables which means users cannot do partition elimination based on FILEPATH or complex wildcard expressions via OPENROWSET (which can be done with views) * External tables can be shareable with other computes, since their metadata can be mapped to and from Spark and other compute experiences, while views are SQL queries and thus can only be used by SQL On-demand or SQL Provisioned pool * External tables can use indexes to improve performance, while views would require indexed views for that * Sql On-demand automatically creates statistics both for a external table and views using OPENROWSET. You can also explicitly create/update statistics on files on OPENROWSET.
Note that automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV files statistics is supported * Views give you more flexibility in the data layout (external tables expect the OSS Hive partitioning layout for example), and allow more query expressions to be added * External tables require an explicit defined schema while views can use OPENROWSET to provide automatic schema inference allowing for more flexibility (but note that an explicitly defined schema can provide faster performance) * If you reference the same external table in your query twice, the query optimizer will know that you are referencing the same object twice, while two of the same OPENROWSETs will not be recognized as the same object. For this reason in such cases better execution plans could be generated when using external tables instead of views usingOPENROWSETs
* Row-level security (Polybase external tables for Azure Synapse only) and Dynamic Data Masking will work on external tables. Row-level security is not supported with views using OPENROWSET * You can use both external tables and views to write data to thedata lake via CETAS
(this is the only way either option can write data to the data lake) * If using SQL On-demand, make sure to read Best practices for SQL on-demand (preview) in Azure Synapse Analytics I often get asked what is the difference in performance when it comes to querying using an external table or view against a file in ADLS Gen2 vs. querying against a highly compressed table in a SQL Provisioned pool (i.e. managed table). It’s hard to quantify without understanding more about each customers scenario, but you will roughly see a 5X performance difference between queries over external tables and views vs. managed tables (obviously, depending on the query, that will vary but that’s a rough number – could be more than 5X in some scenarios). A few things that contribute to that: in-memory caching, SSD based caches, result-set caching,
and the ability to design and align data and tables when they are stored as managed tables. You can also create materialized views for managed tables which typically bring lots of performance improvements as well. If you are querying Parquet data, that is in a columnstore file format with compression so that would give you similar data/column elimination as what managed SQL clustered columnstore index (CCI) would give, but if you are querying non-Parquet files you do not get this functionality. Note that for managed tables, on top of performance, you also get a granular security model, workload management capabilities, and so on (see DataLakehouse & Synapse
).
Posted in Azure Data Lake, Azure
Synapse Analytics/SQL DW,
SQLServerPedia Syndication| 4 Comments
TOP AZURE SYNAPSE ANALYTICS AND POWER BI QUESTIONS Posted on October 22, 2020by James Serra
I have listed below the top questions I see from customers about Azure Synapse Analytics and Power BI, and the blogs that I have written that try to answer each question. I have also added these questions to my blogs FAQ page that also includes “Modern data warehouse” and “SQL Server migration to Azure”, which I have updated. AZURE SYNAPSE ANALYTICS Below are the top questions I am seeing from customers about Azure Synapse Analytics (formally called Azure SQL Data Warehouse), and the blogs that I have wrote that try to answer each question: * What is the difference between the GA version of Synapse and the one in public preview? Azure Synapse Analytics confusion * What are the newest features for the SQL Pool? Azure Synapse Analytics new features, Microsoft
Ignite Announcements 2020 * What are the performance options when using Synapse and Power BI? Azure Synapse Analytics & Power BI performance * Can I run Power BI dashboards against Synapse? Azure Synapse Analytics & Power BI concurrency * Can you explain the new SQL on-demand in Synapse? SQL on-demand in Azure Synapse Analytics * What is a “Data Lakehouse”? Data Lakehouse & Synapse * What are all the ways to query data in Synapse? Query options in Azure Synapse Analytics * Are there samples to use in Synapse? Azure Synapse Analyticsoverlooked features
* How can I create federated queries in Synapse? Synapse andfederated queries
POWER BI
Below are the top questions I am seeing from customers about Power BI, and the blogs that I have wrote that try to answer each question: * What are Composite models and Aggregations? Power BI new feature:Composite models
* What are Dataflows? Can I clean data with Power BI? Power BI:Dataflows
, Where
should I clean my data? * What is the Microsoft Power Platform? The Microsoft Power Platform * Can Power BI access data in ADLS Gen2? Ways to access data in ADLSGen2
* What are performance options when using Power BI against Synapse? Azure Synapse Analytics & Power BI performance * Can I use Power BI dashboards against Synapse? Azure Synapse Analytics & Power BI concurrency * How can I stream data into Power BI and do real-time updates? Power BI Real-time Streaming * What are the various type of security options in Power BI? PowerBI Security
* What are all the ways to share dashboards and reports in Power BI? Power BI Sharing * What are the ways to monitor performance and usage in Power BI? Monitoring Power BI * What were the latest announcements about Power BI at Ignite? Microsoft Ignite Announcements 2020 * What are other questions you hear customers asking? Power BI FAQ Posted in Azure Synapse Analytics/SQL DW, Power BI
,
SQLServerPedia Syndication| 4 Comments
SYNAPSE AND FEDERATED QUERIES Posted on October 15, 2020by James Serra
As I was finishing up a demo script for my presentation at the SQL PASS Virtual Summit on 11/13 (details on my session here),
I wanted to blog about part of the demo that shows a feature in the public preview of Synapse that is frankly, very cool. It is the ability to query data as it sits in ADLS Gen2, a Spark table, and Cosmos DB and join the data together with one T-SQL statement using SQL On-demand (also called SQL serverless), hence making it a federated query (also known as data virtualization or a logical data warehouse). The beauty of this is you don’t have to first write ETL to collect all the data into a relational database in order to be able to query it all together, and don’t have to provision a SQL pool, saving costs. Further, you are using T-SQL to query all of those data sources so you are able to use a reporting tool like Power BI to see the results (or any tool that supports T-SQL). Synapse is the tool to easily create that virtual layer on topof the data.
The queries are also fast as SQL On-demand will push-down queries from the front-end to the back-end nodes (which contain a SQL Server engine) with those back-end nodes sitting next to the storage (this is done via POLARIS: The Distributed SQL Engine in Azure Synapse, which is being
used for SQL On-demand pools and eventually for SQL Provisioned pools). Note there is no cache yet in SQL On-demand so the queries won’t run faster after the first run. My demo queries tweets in parquet files in ADLS Gen2. Users or Cities can be banned and the query has to return the tweets only from non-banned entities. The list of banned users are in a Spark table and the banned cities are in a Cosmos DB table. Now on to my demo: Working inside Azure Synapse Studio,
first, I wrote a T-SQL view that queried all the parquet files in an ADLS Gen2 folder (the key is the OPENROWSETfunction):
1
2
3
4
5
6
7
8
9
CREATE VIEW . AS
SELECT
ADLSTable.*
FROM
OPENROWSET(
BULK
'https://asaexpdatalakejs1234.dfs.core.windows.net/twitterdata/*.parquet',FORMAT='PARQUET'
) AS ADLSTable
Notice how I am using a wildcard in the file path to query all parquet files in the folder instead of just one. Then I wrote a T-SQL view that queried a Spark table called _bannedusers _in the Spark database _users_. Even though it is a Spark table and the Spark cluster is not running, I am able to query this table using SQL On-demand because of a feature in Synapse that copies the metadata for the Spark table to SQL On-demand so you can use it outside of Spark (see Azure Synapse Analytics shared metadata).
Plus, when a Spark table is created it actually stores the data in the Synapse primary storage area (in my case it is in the folder _asaexpworkspacejs1234/tempdata/synapse/workspaces/ serrademoworkspace/warehouse/users.db/bannedusers_) so there is no need to use the Spark engine as SQL On-demand can directly query thatstorage location:
1
2
3
4
CREATE VIEW . AS
SELECT
SparkTable.*
FROM users.dbo.bannedusers AS SparkTable In both views above the data is read directly from storage, in parallel, and filtering is done at the SQL engine nodes themselves. Also, in some cases the data being read is filtered – for example, eliminating some files early on based on partition elimination, and in case of Parquet also skipping reading some column segments where possible, and in the future using such technologies as Query Acceleration for ADLS.
Then I wrote a T-SQL view that queries a database in Cosmos DB called _CosmosIOTDemo _and the container _bannedcities _using Azure SynapseLink for Cosmos DB
(see
demo here
):
1
2
3
4
5
6
CREATE VIEW . AS
SELECT
CosmosTable.*
FROM OPENROWSET (
'CosmosDB',
N'account=synapselinkdemoserra;database=CosmosIOTDemo;region=eastus;key=xxx',bannedcities ) with ( id varchar(36), city varchar(1), cityname varchar(30), banned varchar(1) ) AS CosmosTable When using the Azure Synapse Link for Cosmos DB, the SQL on-demand compute engine pushes down predicates to the remote, columnar store of Cosmos DB (aka Analytical Store)
and returns the results to the Synapse compute layer. Finally, I created a view using all three above views to join the data together. The parquet files in ADLS Gen2 are twitter feeds, and the Spark table and Cosmos DB table are reference tables that list the users and cities in the twitter feeds that should be excluded from the query (they have to be in the table with the banned field set to Y to not show in the query). While the twitter feed was only about 4000 rows, this query only took 2-3 seconds (you may have to wait an additional 2-3 seconds for the SQL On-demand pool to warm up the firsttime):
1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE VIEW . AS
SELECT
t1.*,
t2.Banned as UserBanned, t3.Banned as CityBannedFROM
vFQ_ADLSTable t1
LEFT JOIN vFQ_SparkTable t2 ON (t1.UserName COLLATE SQL_Latin1_General_CP1_CI_AS = t2.username) LEFT JOIN vFQ_CosmosTable t3 ON (t1.City = t3.cityname) WHERE (t2.banned is NULL or t2.banned = 'N') AND (t3.banned is NULL or t3.banned = 'N') Then I went to the Develop hub, and under Power BI chose “Power BI datasets” (first you have to link to a Power BI workspace),
chose “New Power BI dataset”, chose my SQL On-demand pool, and downloaded the Power BI Desktop Source (.pbids) file that has the connection info to the SQL On-demand database. Once I clicked on that download file, it opened up Power BI Desktop and allowed me to choose the combined query _vFQ_Combined_. You can chooseto import
the data or use DirectQuery to get the results in real-time as any file that is dropped into the ADLS Gen2 folder along with the other parquet files will be read. It appears to the end user that the data is stored in a relational database because the view is creating an abstract layer with the metadata (although because of automatic schema inference you don’t even need to specify the field layouts for a file in ADLS Gen2 or a Spark table). Data in ADLS Gen2 can be moved to different access tiers to save costs, and eventually when SQL On-demand supports querying the relational SQL Provisioned pool you can archive out older relational data to ADLS Gen2 and use a query on that storage and combine it with a query on the relational database to also save costs and improve performance within the relational database. As I said, very cool!More info:
Direct Query your Data Lake Files using Azure Synapse Analytics andPower BI
Posted in Azure Cosmos DB, Azure
Data Lake
, Azure
Synapse Analytics/SQL DW,
SQLServerPedia Syndication| 6 Comments
MICROSOFT IGNITE ANNOUNCEMENTS 2020 Posted on September 30, 2020by James Serra
Microsoft Ignite has always announced many new products and new product features, and this year was no exception. Many exciting announcements, and below I list the major data platform related announcements: USAGE-BASED OPTIMIZATION WITH AZURE SYNAPSE AND POWER BI: A new feature that analyzes usage patterns in Power BI and shares the information with Synapse is now available to improve query performances. Synapse automatically creates a materialized view optimized for Power BI users — greatly accelerating the speed of a query performance. More info APACHE SPARK FOR AZURE SYNAPSE IN-CLUSTER CACHING AND SHUFFLE SERVICE (PREVIEW): Caching and shuffle are two of the components of infrastructure for Apache Spark that have the greatest impact on performance. These new services, which are written from scratch, allow the optimization of performance for these components on modern hardware and operating systems. The service is enabled for Apache Spark Pools in Azure Synapse today. HIGH THROUGHPUT OUTPUT FROM STREAM ANALYTICS TO AZURE SYNAPSE ANALYTICS (GENERAL AVAILABILITY): Azure Stream Analytics jobs now have the ability to output to an existing SQL pool table in Azure Synapse Analytics, and can process throughput rates even higher than 200 MB/s. This will support the most demanding real-time analytics and hot-path data processing needs of Azure customers who depend on their Data Warehouse for workloads such as reporting and dashboarding. AZURE AD AUTHENTICATION FEATURES FOR AZURE SQL DB, AZURE SYNAPSE ANALYTICS, AND AZURE SQL MANAGED INSTANCE: Three new features using Azure Active Directory (Azure AD) authentication are currently in preview for Azure SQL Database, Azure Synapse Analytics, and Azure Managed Instance. These features help automate user creation using Azure AD applications and allow individual Azure AD guest users to be created in SQL Database, Azure Synapse Analytics, or SQL Managed Instance. First, support is being added for Azure AD user creation in SQL Database and Azure Synapse Analytics on behalf of Azure AD applications. This functionality is useful in the Azure AD application automation processes where Azure AD objects are created and maintained in SQL Database and Azure Synapse Analytics without human interaction. Second, users can use cloud groups to manage Directory Readers role assignment in Azure AD. This change will promote ease of use when setting up SQL Database, Azure Synapse Analytics, and SQL Managed Instance to create users. Finally, Azure AD guest users can now be created as database users and set as Azure AD admin without the need to first add them as members of a group created in Azure AD. These features extend existing functionality, remove user limitations, and provide customers with greater ease of use when setting up the SQL Database, Azure Synapse Analytics, or SQL Managed Instance. NEW AZURE COGNITIVE SERVICES AND CAPABILITIES IN PREVIEW: In the Decision category, we have the new Metrics Advisorservice that
proactively monitors metrics and diagnoses issues so organizations can protect their growth engine, from sales revenue to manufacturing. Built on top of Anomaly Detector, Metrics Advisor can be applied to business metric monitoring, AI Ops, and predictive maintenance scenarios. See an example in action. In the Vision category, the new spatial analysis feature in Computer Visionhelps
organizations maximize the value of their physical spaces by understanding peoples movements in near-real time – especially critical these days in helping organizations reopen safely by supporting social distancing and other health compliance measures. It comes available in a container to run at the edge. Here’s a walkthrough of how customers are applying it. Additionally for containers, the following are available in preview: for Speech there are automatic languagedetection
and neural
text-to-speech
containers
and for Computer Vision, the Read 3.0 and 3.1containers
for text extraction. ANNOUNCING AZURE SQL EDGE NOW IN GENERAL AVAILABILITY: Azure SQLEdge , which
brings the most secure Microsoft SQL data engine to Internet of Things (IoT) gateways and edge devices, is now available. Optimized for IoT workloads, Azure SQL Edge supports built-in data streaming, storage, and artificial intelligence packed into a small footprint container that works in connected, disconnected, and semi-connected environments where local storage and analytics are important. You can reduce cost, drive innovation, and improve operational efficiencies by using Azure SQL Edge to stream, store, and analyze IoT data. You simply need to develop your application once and deploy it anywhere across the edge, your datacenter, and Azure. HOSTING CATALOG DATABASES FOR ALL SUPPORTED VERSIONS OF SSRS IN AZURE SQL MANAGED INSTANCE: Now in general availability, Azure SQL Managed Instance can host catalog databases for all supported versions of SQL Server Reporting Services (SSRS). Last year, SQL Server 2019 Reporting Services introduced native support for hosting catalog databases in SQL Managed Instance. Now you can also use SQL Managed Instance to host catalog databases for earlier supported versions of SSRS. This is especially useful for fast migration of the existing solutions to SQL Managed Instance, without the need to test and adopt the latest version of SSRS. It also helps you get quick business benefits and then modernize further at your own pace. To learn how to configure your SQL Managed Instance to host catalog databases for earlier supported versions of SSRS, visit the Tech Community blog.
MAJOR PERFORMANCE IMPROVEMENTS FOR AZURE SQL MANAGED INSTANCES: We are announcing a set of major performance improvements for Azure SQL Managed Instances , which enable you to migrate your more performance-hungry database workloads to Azure SQL Managed Instance. These improvements include better transaction log write throughput for general purpose and business critical instances and superior data/log IOPS for business critical instances. Increased performance predictability and stability for general purpose service tier through better integration and enhanced tempdb performance are also included. These improvements are automatically enabled for all existing and future Azure SQL Managed Instances at no extra charge, making Azure SQL Managed Instance the best fully-managed database platform for your mission-critical workloads. Learn more about major performance improvements for Azure SQL Managed Instances MACHINE LEARNING ON AZURE SQL MANAGED INSTANCE IN PREVIEW: Machine Learning Services with support for R and Python languages now include preview support on Azure SQL Managed Instance . When using Machine Learning Services with R or Python support in Azure SQL Managed Instance, you can run R and Python scripts to do data preparation and general purpose data processing, train machine learning models in database, and deploy your models and scripts into production in stored procedures. Azure SQL Managed Instance CONFIGURABLE BACKUP STORAGE REDUNDANCY OPTIONS FOR AZURE SQL DATABASE COMING SOON: By default, all backups taken on Azure SQL Database are stored on read-access geo-redundant storage (RA-GRS) blobs that are being replicated in paired regions,
ensuring data is protected in case of planned and unplanned events. Preview of configurable backup storage redundancy options for Azure SQL DB is coming soon to South East Asia, with additional regions expected in the coming months. This feature will provide you with more flexibility and choice to pick between locally redundant storage (LRS), zone redundant storage (ZRS) or geo redundant (RA-GRS) storage blobs for their backups. As additional regions are launched in preview, there will be more backup storage redundancy options available to use to meet your data residency requirements or minimize backup costs. You will be able to easily configure backup storage redundancy in the Azure portal. For more information, visit our TechCommunity blog .
AZURE SYNAPSE LINK NOW SUPPORTS AZURE COSMOS DB API FOR MONGO DB (PREVIEW): Azure Synapse Link for Azure Cosmos DBnow
supports API for Mongo DB in preview. You can analyze the data in Azure Cosmos DB API for Mongo DB collections, using Synapse ApacheSpark
or Synapse
SQL serverless
,
without impacting the performance of your transactional workloads. Azure Synapse Link, announced earlier this year, creates a tight integration between Azure Cosmos DB and Azure Synapse Analytics. You can directly access your Azure Cosmos DB data from Azure Synapse Analytics and run near real-time business intelligence, analytics, and machine learning pipelines. This integration enables you to build cloud-native HTAP (Hybrid transactional/analytical processing) solutions to perform no-ETL (extract, transform, and load) analytical processing on operational data in Azure Cosmos DB. AZURE SYNAPSE LINK FOR AZURE COSMOS DB: SQL SERVERLESS RUNTIME SUPPORT COMING SOON: Azure Synapse Link will soon have preview support for querying Azure Cosmos DB data with Synapse SQL serverless . In the coming months, you will be able to run analytical T-SQL queries over data in Azure Cosmos DB in place within seconds. These queries can be used to build rich near real-time dashboards using Power BI integration with Azure Synapse Analytics. This new feature enhances the capability of Azure Synapse Link and will give you a wide range of business intelligence and ad-hoc querying tools via the T-SQL interface. Azure Synapse Link, announced
earlier this year, creates a tight integration between Azure Cosmos DB and Azure Synapse Analytics. This gives you direct access to Azure Cosmos DB data from Azure Synapse Analytics to run near real-time business intelligence, analytics, and machine learning pipelines. This integration enables you to build cloud-native HTAP (Hybrid transactional/analytical processing) solutions to perform no-ETL (extract, transform, and load) analytical processing on operational data in Azure Cosmos DB, without impacting the performance of your transactional workloads. Azure Synapse Analytics will soon support Azure Synapse Link with Synapse Apache Sparkand Synapse
SQL serverless
.
SERVERLESS OFFER FOR AZURE COSMOS DB COMING SOON ON ALL APIS: Azure Cosmos DB serverlessis in preview
for the Core (SQL) API and will soon be available in preview for all Azure Cosmos DB APIs, including MongoDB, Cassandra, Gremlin, and Table. This offers NoSQL developers a database operations option with zero capacity planning or management required. As a consumption-based option with no minimum, serverless could also significantly lower the entry price for applications. It is ideally suited for small and medium-sized workloads that do not require steady throughput, receive requests sporadically or in short bursts, and have moderate performance requirements. Learn more.
ENHANCED MANAGEMENT EXPERIENCE FOR AZURE SQL MANAGED INSTANCE: Management operationsare
an essential part of Azure SQL Managed Instance, covering create, update, and delete scenarios. Most of the management operations In SQL Managed Instance are long-running but did not have operation progress visible. Through the introduction of a new CRUD API version, the SQL Managed Instance resource is now visible from when the create request is submitted. In addition, the new OPERATIONS API adds the ability to monitor management operations, see operation steps, and take dependent actions based on operation progress. Learn More TRANSACTIONALLY CONSISTENT DATABASE COPY FOR AZURE SQL DATABASE HYPERSCALE TIER: Now in preview, the Hyperscale service tier in Azure SQL Database supports generating a transactionally consistent database copy to either the same or different logical server or region, similar to database copy already supported in other service tiers. This enables scenarios where Hyperscale databases have to be copied for various reasons, such as development and testing. If a database copy is created within the same region, it is a fast operation regardless of data size, similar to a point in time restore in Hyperscale. A cross-region database copy is a slower size of data operation. If you require support for database copy in your workflow, you can now take advantage of unique Hyperscale capabilities such as support for very large databases, fast scaling, and multiple replicas. Learn how to create a database copyand
visit our Hyperscale database copy blogfor more details.
HYPERSPACE, AN INDEXING SUBSYSTEM FOR APACHE SPARK™, IS NOW OPEN SOURCE: For Microsoft’s internal teams and external customers, we store datasets that span from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative ”finding the needle in a haystack” type of queries (e.g., point-lookups, summarization). Resorting to linear scans of these large datasets with huge clusters for every simple query is prohibitively expensive and not the top choice for many of our customers, who are constantly exploring ways to reducing their operational costs – incurring unchecked expenses are their worst nightmares. Over the years, we have seen a huge demand for bringing indexing capabilities that come de facto in the traditional database systems world into Apache Spark™ . Today, we are making this possible by releasing an indexing subsystem for Apache Spark called Hyperspace – the same technology that powers indexing within Azure Synapse Analytics.
NEW MERGE COMMAND FOR AZURE SYNAPSE ANALYTICS: The new MERGE command in Azure Synapse Analytics allows customers to insert, update, or delete data from one table based on the differences discovered in another table. With the MERGE command, the previous multi-step process to synchronize two tables in a dedicated SQL pool is now streamlined into a single step, thus reducing the data operation cost and improving the user experience. Learn More COPY COMMAND NOW GENERALLY AVAILABLE IN AZURE SYNAPSE ANALYTICS: The COPY command feature in Azure Synapse Analytics provides users a simple, flexible, and fast interface for high-throughput data ingestion for SQL workloads. With this announcement, we’ve added performance improvements with file splits, ease of use with Parquet auto-schema discovery and complex data type support, and the support of the COPY command within our data integration partner ecosystem including: Databricks, Informatica, Streamsets, Panoply, and Talend.Learn More
COLUMN-LEVEL ENCRYPTION FOR AZURE SYNAPSE ANALYTICS: Column-level encryption (CLE) helps customers implement fine-grained protection of sensitive data within a table (server-side encryption). With CLE, customers gain the ability to use different protection keys for columns in tables with each key having its own access permissions. The data in CLE enforced columns is encrypted on disk (and remains encrypted in memory) until the DECRYPTBYKEY function is used to decrypt it. This feature applies to dedicated SQL pools in Azure Synapse Analytics. Learn More AZURE DATABRICKS – DELTA ENGINE WITH PHOTON: Announced the preview of Photon powered Delta engine on Azure Databricks– fast,
easy, and collaborative Analytics and AI service. Built from scratch in C++ and fully compatible with Spark APIs, Photon is a vectorized query engine that leverages modern CPU architecture along with Delta Lake to enhance Apache Spark 3.0’s performance by up to 20x. As organizations worldwide embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly analyze massive amounts and types of data. However, this has been a challenge. While storage and network performance have increased 10x, CPU processing speeds have only increased marginally. This leads to the question if CPUs have become the bottleneck, how can we achieve the next level of performance? The answer with Photon lies in greater parallelism of CPU processing at both the data-level and instruction-level. Photon powered Delta Engine is a 100% Apache Spark-compatible vectorized query engine designed to take advantage of modern CPU architecture for extremely fast parallel processing of data. Read the blogto
learn more.
AZURE ARC ENABLED DATA SERVICES: Announcing Azure data services anywhere, a new capability enabled by Azure Arc that allows customers to run Azure data services on-premises, multi-cloud and edge, using Kubernetes on the hardware of choice. Azure Arc enabled SQL Managed Instance and Azure Arc enabled PostgreSQL Hyperscale are available in preview now, with other services to come over time. Customers can now deploy a variety of data services, with access to latest innovation, cloud benefits like automation and elastic scale, unified management across all data assets in hybrid environments, and unique Azure security and governance capabilities like Azure Security Center, Policies and Role-based Access Control for on-premises data workloads. Best of all, customers can use modern cloud billing models on-premises for better cost efficiency. Learn more about Azure Arc enabled dataservices
AZURE ML IGNITE 2020 ANNOUNCEMENTS: Azure Machine Learning Enterprise Edition (preview) capabilities are merging into Basic Edition on September 22, 2020. With these enhanced capabilities included in Basic Edition, the Enterprise Edition of Azure Machine Learning will retire on January 1, 2021.Designer
* No-code drag and drop experience for building and deploying MLmodels
AutoML UI
* Automatically build and deploy predictive models using the no-code UI or through a code-first notebooks experience.ML Labeling Assist
* The machine learning assisted labeling feature lets users trigger automatic machine learning models to accelerate the labeling task. Advanced RBAC (preview) * AzureML operation level RBAC will allow customers to set custom roles, or re-use pre-built roles to control specific operations for the individual users in a workspace. mlflow updates (preview) * MLflow projects support for simplified job submission from local tocloud
* Mlflow model registry support * Model deployment support (model from MLflow model registry and deploy it to ACI or AKS) * Expanded support for MLflow experimentation UI Workspace Private Link (preview) A network isolation feature that enables you to access Azure ML over a private IP in your virtual network (VNet). AZURE SQL DATABASE – CROSS-SQL MI DISTRIBUTED TX – PREVIEW: The distributed database transactions spanning multiple Azure SQL Managed Instances will be added soon to enable frictionless migration of existing applications, as well as development of modern multi-tenant applications relying on vertically or horizontally partitioned database architecture. By utilizing distributed transactions once public preview is announced, customers will be able to save time when migrating existing applications that require this capability as it eliminates the need to change application code and to perform extensive testing. If customers develop new applications, they will be able to benefit from partitioning data into multiple databases to overcome current sizing limitations of Azure SQL Managed Instance while utilizing distributed transactions to keep partitioned data in a consistent state. Once preview is announced, two methods of running distributed transactions will be supported using BEGIN DISTRIBUTEDTRANSACTION
statement
from Transact-SQL code and using TransactionScope classfrom
.NET code.
POWER BI ANNOUNCEMENTS FROM IGNITE: _(many of these features are talked about at Driving a data culture in a world of remote everything)_
* Power BI Premium
will soon be
available on a per-user basis. Individuals will be able to upsize their per-seat Professional licenses to Premium. During the preview period, the upgrade carries no additional cost. See Answering your questions around the new Power BI Premium per user license * Power BI Premium is getting faster, and is gaining an autoscale capability that, when enabled, allows Microsoft to provision additional “V-cores” (virtual CPU cores) to the customer’s cloud tenant for periods of 24 hours, when overloads are detected. The v-cores are automatically removed during idle periods. See Announcing the Upcoming Evolution of Power BI Premium to enterprisemarkets and beyond
* Preview of “smart narratives,” an augmented analytics feature which provides plain-English summarizations of the data in a report, either on a per-visualization or page-wide basis. The narratives automatically update when data is filtered or drilled down upon, and the narratives are editable, both in terms of formatting and for insertion of arbitrary or calculation-driven text. See Smart Narrative(preview)
* Dedicated Power BI add-in application for Microsoft’s Teams collaboration platform, released as a preview. The Teams integration includes the ability to browse reports, dashboards and workspaces and directly embed links to them in Teams channel chats. It’s not just about linking though, as Teams users can also browse Power BI datasets, both through an alphabetical listing of them or by reviewing a palette of recommended ones. In both cases, datasets previously marked as Certified or Promoted will be identified as such, and Teams users will have the ability view their lineage, generate template-based reports on them, or just analyze their data in Excel. See Announcing: New Power BI experiences in Microsoft Teams * Through a new preview feature, Power BI asset sensitivity levels, applied as Microsoft Information Protection(MIP)
labels, can be applied from within Power BI Desktop , making it unnecessary to apply them to the report and its underlying dataset in the cloud service’s user interface. More info * Microsoft is enhancing its Power Query technology to feature a visual interface rather than relying on the data grid view that has been its hallmark. Essentially, Power Query gets a new Diagram View where all queries in the model, and each of the transformation steps within them, will appear together, in a flowchart-like representation. This visual presentation indicates the dependencies between queries and the sequence of individual transformations within each. New transforms can be easily added in the Diagram View as well. Visual data preparation is coming soon. It will be available initially in Power BI dataflows only (i.e. online), and not in Power BI Desktop. The capability will come to Desktop within the next calendar year. More info * A visual anomaly detection (coming soon). More info * A new visual that can trigger Power Automate processes in a data-driven fashion (also coming soon) * A performance accelerator for Azure Synapse Analyticsthat
automatically creates materialized views to accelerate frequent queries from Power BI. More info * GA of deployment pipelines (including availability on government clouds) that can move Power BI assets between development, test and production environments. See Announcing deployment pipelines GeneralAvailability (GA)
* A preview of a data impact analysis feature that can notify users of datasets that may be affected by a change to another datasetupstream. More info
More info:
Driving a data culture in a world of remote everything Microsoft Ignite 2020 Book of News What’s new in Azure SQL Managed Instance at Ignite 2020Posted in Azure Arc
, Azure SQL
Database ,
Azure Synapse Analytics/SQL DW, Delta
Lake , Power
BI ,
SQLServerPedia Syndication| 3 Comments
AZURE SYNAPSE ANALYTICS OVERLOOKED FEATURES Posted on September 23, 2020by James Serra
There are some options in Azure Synapse Analyticsthat
are not obvious that I wanted to point out. While in Azure Synapse Studio, going to the Data hub and clicking the “+” at the top to the right of Data brings up: Under “Workspace”, choosing “SYNAPSE SQL DATABASE” gives you the ability to create a database in a SQL on-demand pool.
A SQL on-demand pool (which you access via an endpoint) can have multiple databases. Note that SQL on-demand has no local storage so only metadata objects are stored in the databases. After you create a database, if you then go to the Develop hub and create a new SQL script, you can choose to connect to “SQL on-demand”, and then the newly created database will be available under “Use database”. You can then create views or external tables in that database. When you go to the Data hub, under Databases you will see the newly created database (you might need to hit refreshfirst).
Under “Linked”, choosing “CONNECT TO EXTERNAL DATA” will create a_ linked service_ to ADLS Gen2 or Cosmos DB that you can use in the Data hub (under “Linked”). This is the same method as going to the Manage hub and choosing “Linked services” and clicking “New” (but that option has more sources to connect to). These linked services can also be used in Azure Data Factory (see Linked Services in Azure Data Factory).
Under “Linked”, choosing “INTEGRATION DATASET” will allow you to create a _dataset_ that can be used in pipeline activitiesand in data flows
.
When creating the dataset you will need to specify a linked service to use for the dataset (which you could of done via “Connect to external data”), or you can create a new linked service (see Datasets in Azure Data Factory).
Under “Linked”, choosing “BROWSE SAMPLES” will give you a ton of samples to choose from broken into four sets: * Datasets: A bunch of datasets that when added will show up under Data -> Linked -> Azure Blob Storage. You can then choose an action (via “…” next to any of the containers in the dataset) and choose New SQL script -> Select TOP 100 rows to examine the data as well as choose “New notebook” to load the data into a Spark dataframe. Any dataset you add is a linked service to files in a blob storage container using SAS authentication. You can also createan external table
in a SQL on-demand pool or SQL provisioned pool to each dataset via an action (via “…” next to “External tables” under the database, then New SQL script -> New external table) and then query it or insert the data into a SQL provisioned database * Notebooks: A bunch of sample Apache Spark notebooks in various languages (PySpark/Scala/Spark.NET C#/SparkSQL) that will show up under Develop -> Notebooks * SQL scripts: A bunch of sample SQL scripts that will show up under Develop -> SQL scripts * Pipelines: This is similar to choosing the “Create pipeline fromtemplate
”
on the home page of a Azure Data Factory workspace and will create apipeline
that will show up under Orchestrate -> Pipelines These samples offer a great way to shortcut the process of building a solution in Synapse.More info:
Quickly get started with samples in Azure Synapse Analytics Posted in Azure Synapse Analytics/SQL DW,
SQLServerPedia Syndication| 2 Comments
QUERY OPTIONS IN AZURE SYNAPSE ANALYTICS Posted on September 16, 2020by James Serra
_(updated 11/12/20)_ The public preview version of Azure Synapse Analytics has three compute options and four types of storage that it can access (mentioned in my blog at SQL on-demand in Azure Synapse Analytics).
This gives twelve possible combinations of querying data. Not all of these combinations currently are supported and some have a few quirks of which I list below. _(NOTE: I’ll demo these features at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3
(session info
), PASS
Summit on 11/13 (session info),
and Big Data Conference Europe on 11/25 (session info ). I hope you can join!)_Storage >
Compute RELATIONAL DATABASEADLS GEN2
SPARK TABLE
COSMOS DB
DEDICATED SQL POOL
Y
Y _(*1)_
N_ (*2)_
N _(*6)_
SERVERLESS SQL POOL
N _(*6)_
Y
Y _(*3)_
N _(*4)_
APACHE SPARK POOL
Y
Y
Y
Y _(*5)_
*1: This is available via an external table which uses the Polybase technology and does not use push-down queries so can be slow. A feature that will be available after Azure Synapse Analytics goes GA called fast parquet will speed up queries over external tables mapped to parquet files (the technology underneath is the same that is being used for SQL on-demand) *2: This feature will be available soon, but will only support Spark tables created using Parquet as the input format *3: This feature only supports Spark tables created using Parquet as the input format (explained here)
*4: This feature is available via the Azure Synapse Linkand
will soon be available *5: This feature is available via the Azure Synapse Link *6: This feature will be available after Azure Synapse Analytics goesGA
If you have data in ADLS Gen2 that is stored in Delta Lake format (via methods such as a Spark Tableor Azure Data
Factory
),
that data can be read via the Apache Spark pool (no matter if the Delta Lake is built via open sourceor Databricks
), soon via
Serverless SQL (a workaround for now is here),
and not via a Dedicated SQL pool. Note that Power BI can read theDelta Lake format
but requires you to use the Power BI Databricks connector. If you have data in ADLS Gen2 that is in the Common Data Model (CDM)format
(via methods such as Dynamics 365 CDS export to ADLS Gen2or stored directly
,
or via Azure Data Factory),
Serverless SQL is not yet able to read it, but you can via the Apache Spark pool (see Spark CDM connector).
Note that Power BI can read CDM if using Power BI dataflows. A huge benefit of using the public preview version of Azure Synapse Analytics is the ability to query a file in the data lake by simply right-clicking the file. When you do that a menu will pop-up giving you three choices on how to query the data. Currently this works on three different files types: parquet, csv, and JSON (note that you can always query additional file types using a Spark notebook). The three query choices are listed below with all but one currently supported:Right-click
object-type: PREVIEWNew SQL script ->
Select TOP 100 rowsNEW NOTEBOOK ->LOAD TO DATAFRAME
PARQUET
N _(*3)_
Y _(*5)_
Y
CSV
Y
Y _(*1)_
Y
JSON
Y
Y _(*2)_
Y
RELATIONAL TBL
N
Y
Y
EXTERNAL TBL
N
Y
Y
SPARK TBL
N
Y
Y
COSMOS DB
N
Y _(*3)_
Y
“Preview” opens a pop-up window with the contents of the file, “Select TOP 100 rows” opens a tab with a T-SQL SELECT statement using SQL on-demand and then you can run the statement to show the results, and “New notebook” opens a Spark notebook that has PySpark code that you can run to load the data into a Spark in-memory DataFrame and display the rows. *1: This feature was just added and has a limitation in that the column headers do not show correctly (a feature is being worked on to enable the query to infer the column names if the file has a headerrow)
*2: Replace “varchar(8000)” with “varchar(MAX)” in the SELECT statement if you receive an error when running the query *3: A work-around is to right-click the file, choose New SQL script -> Bulk load, click the Continue button and you will see a “Previewdata” option
*4: Need to replace “key” with the Cosmos DB key in the generatedSQL
*5: Because of automatic schema inferenceyou
don’t need to specify the field layouts for a file in a query As far as writing to ADLS Gen2 via a Serverless SQL pool, that couldbe done using CETAS
.
Note that dropping an external table does not delete the data. You can also use the copy activity in ADF to read data via views in SQL on-demand and write it to the data lake, but the write is using ADFcompute.
Posted in Azure Synapse Analytics/SQL DW, Delta
Lake ,
SQLServerPedia Syndication| 1 Comment
DATA LAKEHOUSE & SYNAPSE Posted on September 10, 2020by James Serra
I am starting to see this relatively new phrase, “Data Lakehouse”, being used in the data platform world. It’s the combination of “Data Lake” and “Data Warehouse”. In this post I’ll give my thoughts on it, and how the next version of Azure Synapse Analyticsthat
is in public preview fits right in with the Data Lakehouse. First, I want to clear up a bit of confusion regarding Azure Synapse Analytics. If you login to the Azure portal and do a search, you willsee:
Notice that Azure SQL Data Warehouse (SQL DW) is now called Azure Synapse Analytics. This product has been generally available (GA) for a while and is simply a renaming. It contains new features and soon-to-be new features that I blogged about here,
but this is separate from other features in a _public preview_ version that I will call “Synapse PP” (see the new docs ). I’ll call the current GA version “Synapse GA”. I blogged about this here.
I like to think a major feature of a Data Lakehouse as the ability to use T-SQL to query data the same way no matter whether the data is in a data lake (storage such as ADLS Gen2)
or a data warehouse (a relational database such as Synapse GA). Here are some of the major new Synapse PP features that support a DataLakehouse:
* Azure Synapse Studio * Collaborative workspaces* Apache Spark
* On-demand T-SQL
,
see my blog on this at SQL on-demand in Azure Synapse Analytics* ADF integration
* Power BI integration * Distributed T-SQL (over ADLS Gen2) There are some great benefits to having a data lake that I bloggedabout here
. And
there are many reasons to also have a relational data warehouse along with your data lake that I blogged about here.
Up until now, the data lake and the relational data warehouse where sort of on their own island, having to jump to a different product to interface with each. For example, your data lake could be in ADLS Gen2 and you can use Azure Databricksto query it
with SparkSQL (first setting up a connection to ADLS Gen2), while your relational data warehouse could be in Synapse GA and you use SSMS to query it with T-SQL.
Having to use Azure Data Factoryor
Power BI means opening up another tab in your browser, logging into those products, and connecting to ADLS Gen2. So it can be time consuming and costly to create an environment to query both, and also be confusing to query using two different versions of SQL. You can now get the best of both worlds all under one roof via Azure Synapse Studio: using the relational data warehouse when you need fast query performance, high user concurrency, enhanced security, or just prefer working in a relational environment; or using the data lake when you have huge data volumes, need instant access to the data via schema-on-read, or need to deal with semi-structured data. You never have to leave your workspace no matter which tool you want to use, and can use T-SQL no matter if the data is in the data lake or in the relational data warehouse. And querying a file in the data lake is simply a matter of right-clicking the file (more on this in my nextblog).
So at a high level you can view Azure Synapse Studio as supporting business intelligence, data science, T-SQL plus other languages (Python, Scala, C#, SparkSQL), schema flexibility, ACID compliance, any data type, petabytes of data, streaming and batch, with high performance and user concurrency. So as you can see, Synapse PP combines the use of a data lake and a relational database to make it a Data Lakehouse. It is not really a “new” technology, but just a merging of existing technologies to make it easier to gain insights from data in order to make betterbusiness decisions.
With these new features, especially SQL on-demand and T-SQL against ADLS, I am starting to see a few use cases where you may not need a relational database anymore in your modern data warehouse, which goes against my previous way of thinking (see Is the traditional datawarehouse dead?
).
I still feel you will want to have a relational database in your modern data warehouse architecture a large majority of the time, but there will be exceptions. For example, I can land a bunch of parquet files into the data lake and create a T-SQL view on top of that data, where that view is stored in SQL on-demand. Then I call that view from Power BI making it appear to the end-user that they are using a relational database and at the same time they are only paying when the query is being run. So that leads to compute costs savings as well as not having to copy the data from the data lake to a relational database. This means less time is needed to build a solution and there is a reduction in the complexity of the solution and therefore additional cost savings. Below are some of the concerns that arise out of skipping the relational database and just using a data lake and the new data/delta lake features that “counter” some of those concerns: * Speed: Queries against a relational storage will always be faster than against a data lake (roughly 5X) because of missing features in the data lake such as the lack of statistics,
query plans, result-set caching,
materialized views
,
in-memory caching, SSD-based caches, indexes, and the ability to design and align data and tables. Counter: DirectParquet, CSV 2.0,
query acceleration
,
predict pushdown, and sql on-demand auto-scaling are some of the features that can make queries against ADLS be nearly as fast as a relational database. Then there are features like Delta lake and the ability to use statistics for external tables that can add even more performance. Plus you can also import the data into Power BI, use Power BI aggregation tables, or import the data into AzureAnalysis Services
to get
even faster performance. Another thing to keep in mind affecting query performance is Synapse is a Massive parallel processing (MPP) technology
that has features such as replicated tables for smaller tables (i.e. dimension tables) and distributed tables for large tables (i.e. fact tables) with the ability to control how they are distributed across storage (hash, round-robin). This could provide much greater performance compared to a data lake that uses HDFS where large files are chunked across the storage * Security: Row-level security (RLS), column-level security, dynamic data masking,
and data discovery & classification are security-related features that are not available in a data lake. Counter: User RLS in Power BIor
RLS on external tables instead of RLS on a database table, which then allows you to use result set caching in Synapse * Complexity: Schema-on-read (ADLS) is more complex to query than schema-on-write (relational database). Schema-on-read means the end-user must define the metadata, where with schema-on-write the metadata was stored along with the data. Then there is the difficulty in querying in a file-based world compared to a relational database world. Counter: Create a SQL relational view on top of files in the data lake so the end-user does not have to create the metadata, which will make it appear to the end-user that the data is in a relational database. Or you could import the data from the data lake into Power BI, creating a star schema model in a Power BI dataset. _But I still see it being very difficult to manage a solution with just a data lake when you have data from many sources_ * Missing features: Auditing,
referential integrity, ACID compliance, updating/deleting rows of data, data caching, Transparent Data Encryption (TDE),
workload management
,
full support of T-SQL – all are not available in a data lake. Counter: some of these features can be accomplished when using DeltaLake
,
Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi),
but will not be as easy to implement as a relational database and you will be locked into using Spark. Also, features being added to Blob Storage (see More Azure Blob Storage enhancements)
can be used instead of resorting to Delta Lake, such as blobversioning
as a replacement for time travelin Delta Lake
Some additional reasons for using a relational database include helping to break up a large problem into smaller pieces (think of the relational database as an additional step in the data journey from a data lake to the end-user), and the need to handle slowly changingdimensions (SCD)
(see Create Generic SCD Pattern in ADF Mapping Data Flows).
Using time travel
in the Delta Lake or creating daily folders for each table in the data lake are options for handling SCD but much more complex than in a relational database world (see Slowly changing data (SCD) Type 2 operation into Delta tables). I also see
some customers who have source systems that are relational databases and as they are in the early stages of building a modern data warehouse in Azure they will bypass the data lake and copy the relational data right into a relational data warehouse (especially if they have many existing SSIS packages that they are using for their on-prem data warehouse and they want to continue using those for the short-term). They do this for quick wins and eventually will land the data in the data lake. A strong reason for skipping a relational database is because, for the short term, SQL on-demand is not able to access a relational database, so you can only use this feature against a data lake. So if you want to save costs by using SQL on-demand it would have to be against data in the data lake. This could be especially true if you have a “small” data warehouse. There can even be a reason to not copy a large amount of data to a relational database because of the large cost savings in storage: SQL Provisioned pool costs at least $135 per TB per month (some regions cost more); for ADLS Gen2, hot access tier is at least $19 per TB per month, cool access tier is at least $10 per TB per month, and archive access tier is at least $2 per TB per month (some regions cost more) with an additional cost per transaction (which varies greatly so check the Azure calculator). But keep in
mind the tradeoffs listed above. I’m still thinking through this, but these are my initial thoughts and it will be interesting to have discussions with customers on this topic when Synapse PP goes GA. Please comment below on your thoughts! _I’ll be presenting and doing a demo of the public preview of Azure Synapse Analytics at my sessions at European Digital Week on 9/25 (session info ), SQL Bits on 10/3 (session info), PASS
Summit on 11/13 (session info),
and Big Data Conference Europe on 11/25 (session info)._
More info:
What is a Lakehouse? The Data Lakehouse – Dismantling the Hype THE DATA LAKEHOUSE (updated) Data LakeHouse — Paradigm of the Decade Eckerson Group Webinar -Shop Talk – the Data Lakehouse – April 17,2020
All Hail, the Data Lakehouse! (If Built on a Modern Data Warehouse) An Architect’s View of the Data Lakehouse: Perplexity andPerspective
Data Lakehouses Hold Water (thanks to the Cloud Data Lake) The Rise of Data Lakehouse with Data Virtualization Meet the Data Lakehouse The Future of Analytics: Leveraging Data Lakes and Data Warehouses Posted in Azure Data Lake, Azure
Synapse Analytics/SQL DW, Delta
Lake ,
SQLServerPedia Syndication| 9 Comments
COSMOS DB FOR THE SQL PROFESSIONAL – REFERENCING TABLES Posted on August 24, 2020by James Serra
I had a previous blog comparing Cosmos DB to a relational database (see Understanding Cosmos DB coming from a relational world)
and one topic that it did not address that I want to now is how to handle reference tables that are common in the relational databaseworld.
A big difference with Cosmos DB compared to a relational database is you will create a _denormalized data model_. Take a person record for example. You will embed all the information related to a person, such as their contact details and addresses, into a single JSON document. Retrieving a complete person record from the database is now a single read operation against a single container and for a single item. Updating a person record, with their contact details and addresses, is also a single write operation against a single item. By denormalizing data, your application typically will have better read performance and write performance and allow for a scale-out architecture since you don’t need to join tables. _(Side note: “container” is the generic term. Depending on the API, a specific term is used such as “collection” for the SQL API). Think of a container as one or more tables in the relational world. Going a little deeper, think of a container as a group of one or more “entities” which share the same partition key. A relational table shares a schema, but containers are not bound in that way.)_ Embedding data works nicely for many cases but there are scenarios when denormalizing your data will cause more problems than it is worth. In a document database, you can have information in one document that relates to data in other documents. While there may be some use cases that are better suited for a relational database than in Cosmos DB (see below), in most cases you can handle relationships in Cosmos DB by creating a _normalized data model_ for them, with the tradeoff that it can require more round trips to the server to read data (but improve the efficiency of write operations since less data is written). In general, use normalized data models to represent one-to-many relationships or many-to-many relationships when related data changes frequently. _The key is knowing whether the cost of the updates is greater than the cost of the queries_. When using a normalized data model, your application will need to handle creating the reference document. One way would be to use achange feed
that
triggers on the creation of a new document – the change feed essentially triggers an Azure functionthat creates
the relationship record. When using a normalized data model, your application will need to query the multiple documents that need to be joined (costing more money because it will use more request units), and
do the joining within the application (i.e. join a main document with documents that contain the reference data) as you cannot do a “join” between documents within different containers in Cosmos DB (joins between documents within the same container can be done viaself-joins
).
Since every time you display a document it needs to search the entire container for the name, it would be best to put the other document type (the reference data) in a different container so you can have different partition keys for each document type (read up on howpartitioning
can make a big impact on performance and cost). Note that “partitioning” in a RDBMS compared to Cosmos DB are different things: partitioning in Cosmos DB refers to “sharding” or “_horizontal partitioning_“, where replica sets which contain both the data and copies of compute (database) resources operating in a “shared nothing” architecture (i.e. scaled “horizontally” where each compute resource (server node) operates independently of every other node, but with a programming model transparent to developers). Conversely, what is often referred to as “partitioning” in a RDBMS is purely a separation of data into separate file groups within a shared compute (database) environment. This is also often called “_vertical partitioning_”. Another option that is common pattern for NoSQL databases is to create a separate container to satisfy specific queries. For example, having a container for products based on category and another container for products based on geography. Both of those containers for my query/app are being sourced from one that is my “main” or “source” container that is being updated (front end, or another app) and the change feed attached to that pushes out to my other containers that I use for my queries. This means duplicating data, but storage is cheap and you save costs to retrieve data (think of those extra containers as covering indexes in the relational databaseworld).
Since joining data can involve multiple ways of reading the data, it’s important to understand the two ways to read data using the Azure Cosmos DB SQL API: * Point reads – You can do a key/value lookup on a single _item ID_ and partition key. The _item ID_ and partition key combination is the key and the item itself is the value. For a 1 KB document, point reads typically cost 1 request unitwith
a latency under 10ms. Point reads return a single item. See Understanding the difference between point reads and queries in AzureCosmos DB
* SQL queries – You can query data by writing queries using the Structured Query Language (SQL) as a JSON query language. Queries always cost at least 2.3 request units and, in general, will have a higher and more variable latency than point reads. Queries can return many items. See Getting started with SQL queries _The key in deciding when to use a normalized data model is how frequently the data will change_. If the data only changes once a year it may not be worthwhile to create a reference document and instead just do an update to all the documents. But be aware that the update has to be done from the client side spread over the affected documents, doing it in batches as one big UPDATE statement does not exist in Cosmos DB. You will need to retrieve the entire document from Cosmos DB, update the property/properties in your application and then call the ‘Replace’ method in the Cosmos DB SDK to replace the document in question (see CosmosDb – Updating a Document (Partially)).
If you are using SQL API and .NET or Java, you can consider using bulksupport
(.NET)
or bulk executor
(Java). Other ideas would involve using change feed, or if
you really need a level of ACID consistency, you can achieve this using stored procedures,
with snapshot isolation scoped to a single partition (this is not the same as stored procedures in SQL – rather these are designed specifically to support multi-doc transactions). Also be aware that because there is currently no concept of a constraint, foreign-key or otherwise, any inter-document relationships that you have in documents are effectively “weak links” and will not be verified by the database itself. If you want to ensure that the data a document is referring to actually exists, then you need to do this in your application, or through the use of server-side triggers or stored procedures on Azure Cosmos DB. WHAT ARE OLTP SCENARIOS WHERE A RELATIONAL DATABASE IS ESSENTIAL? Avoiding the broader topic of when to use a relational database over a non-relational database,
there are a few use cases where a relational database is essential: * The customer experience and comfort zone is with relational databases. It is a reality that relational databases are ahead in the maturity curve with respect to tooling (an example would be foreign-key constraint behavior). However, it should be noted that this is not the same as saying that “more use cases are technically better suited to the relational model”. Rather, the barrier to entry in new customer projects tends to be lower because mindshare is greater in the relational space. In these cases, it often isn’t worth the effort for companies to upskill * The system really needs strict ACID semantics across the entire dataset. Sharded/partitioned databases like Cosmos DB will not provide ACID guarantees across the entire set of physical partitions (and likely never will). In reality, however, the use cases where this is necessary is quite small. Things like transaction management and other SDK-level things that go along with these aspects come easier in the RDBMS space, but this is really the same as above point – RDBMS is ahead on maturity curve for user-level tooling to help abstract paradigm specific concepts – but this does not make the paradigm better suited to a greater number of use cases * Having a single data store that services both operational and analytical needs with equal utility, including tabular models – this is probably the most powerful argument, and NoSQL engines are likely never going to serve a data structure that coalesces as well into tabular models that produce reports, charts, graphs, etc. But again, history has proven that, at scale, the “one fits all” approach can have some non-trivial drawbacks. And the new Analytical Store in Cosmos DB is addressing the need to service both operational andanalytical needs
You can create complex hierarchical “relationships” in Cosmos DB, which would have to be modelled in separate tables in an RDBMS. Cosmos DB can’t handle them using joins – but again, this is a paradigmatic/semantic difference, not a fundamental flaw in the database model itself. In order to do the equivalent of what one may be trying to achieve in a relational database, you may have to “unlearn what you have learned”, but this comes back to your comfort level with a RDBMS, which is not a trivial thing and can be the main and very valid reason for staying with a RDBMS. In summary, in a NoSQL database like Cosmos DB, most use cases are covered. Some things are a little harder (due to lack of maturity in tooling), but most things are easier, many things can only be done in NoSQL (i.e. handling millions of transactions per second), and very few things cannot be done in a NoSQL database. Most NoSQL engines are characterized by having a lot more configurability, tunability, and flexibility than a RDBMS. And in many ways, that is the hardest challenge for newcomers.More info:
Data modeling in Azure Cosmos DB Video Data modelling and partitioning in Azure Cosmos DB: What every relational database user needs to know Video A tour of Azure Cosmos DB database operations models Posted in Azure Cosmos DB,
SQLServerPedia Syndication | Comments Off on Cosmos DB for the SQL Professional – ReferencingTables
SQL ON-DEMAND IN AZURE SYNAPSE ANALYTICS Posted on August 10, 2020by James Serra
The new SQL on-demand (to be renamed SQL serverless) offering within Azure Synapse Analyticsopens
up a whole new world of getting insights into data immediately, at low-cost, using familiar tools/languages. There is no infrastructure to setup or clusters to maintain. It uses a pay-per-query model (you are only being charged per query for the data processed by the query– see pricing
)
and auto-scales depending on the amount of data being queried. See Quickstart: Use SQL on-demand A SQL on-demand pool currently can access data from ADLS Gen2, Spark Tables, and Cosmos DB using T-SQL (click picture to enlarge): Querying data in ADLS Gen2 storage using T-SQL is made easy because of the OPENROWSET function with additional capabilities (check out the T-SQL that is supported).
The currently supported file types in ADLS Gen2 that SQL-on-demand can use are Parquet, CSV, and JSON. ParquetDirect and CSV 2.0 add performance improvements (see Benchmarking Azure Synapse Analytics – SQL Serverless, using .NET Interactive).
You can also query folders and multiple files and use file metadata in queries.
Here are the three main scenarios that SQL on-demand is great for: * Basic discovery and exploration – Quickly view the data in various formats (Parquet, CSV, JSON) in your data lake, so you can plan how to extract insights from it * Logical data warehouse – Provide a relational abstraction on top of raw or disparate data without having to relocate or transform the data, allowing an always up-to-date view of your data. By putting T-SQL views on top of data in your data lake, this makes it appear to the end user that they are querying data in a relational database since they are using T-SQL, blurring the line between a relational database and a data lake * Data transformation – Simple, scalable, and performant way to transform data in the lake using T-SQL, so it can be fed to BI and other tools, or loaded into a relational data store (Synapse SQL databases, Azure SQL Database, etc.). For example, using the Copy activity in Azure Data Factory you can convert CSV files in the data lake (via T-SQL views in SQL on-demand) to Parquet files in the data lake. See Azure Synapse Analytics: How serverless is replacing thedata warehouse
Different professional roles can benefit from SQL on-demand: * Data Engineers can explore the lake, transform and prepare data using this service, and simplify their data transformation pipelines. For more information, check out this tutorial.
You could even create a view over data in the data lake and use that to refresh your data in a tabular model * Data Scientists can quickly reason about the contents and structure of the data in the lake, thanks to features such as OPENROWSET and automatic schema inference * Data Analysts can explore data and Spark external tablescreated
by Data Scientists or Data Engineers using familiar T-SQL language or their favorite tools, which can connect to SQL on-demand. So if you create a Spark Table,
that table will be created as an external table in SQL On-Demand so you can query it without having to keep a Spark cluster up and running. Currently this is only supported with Spark Tables created using Parquet as the input format * BI Professionals can quickly create Power BI reports on top ofdata in the lake
and
Spark tables. You are connecting to a “database” that has no data, but rather views or external tables to data in the data lake. Not only can you create reports, but you can use Power BI to explore the data and do basic discovery Here are the top FAQ’s about SQL on-demand that I have heard fromcustomers:
* _Will SQL on-demand be able to query SQL Pool tables?_ It will be able to in the future * _Is SQL serverless MPP?_ SQL serverless uses nodes to scale out processing, similar to MPP, although it is completely different architecture from existing MPP data warehouses. It uses customizedSQL engines
* _What are the warmup times when submitting a query?_ For 95 percentile it is less than 2s. There is a timeout defined after which there is scale-down, so if you don’t execute any query targeting storage during that timeout, your resources will be “taken from you”. Once you get back/connect and execute a query that targets storage, resources are granted back to your endpoint. Maintained is a pool of warm clusters to get the quick warmup time * _Is there a predicate_ _pushdown concept for SQL on-demand in Azure Synapse? _ Yes, there is filter pushdown where SQL on-demand will push down queries from the front-end to back-end nodes. For example, if you query parquet files parquet metadata is used to target only column groups that contain values you are looking for. Microsoft is expanding the range of cases in which filter pushdown is supported * _What are the best practices for SQL on-demand?_ Check out https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-sql-on-demand * _Why should I choose serverless and have a penalty on first (warmup) query, instead of using provisioned?_ It depends on your workload. For constant or high workloads provisioned might be a better choice, while for sporadic or ad-hoc or exploratory workloads serverless might be better fit. It also brings a difference in charges: provisioned means you are paying for resources, while in serverless you are paying for what you use * _So when would you use provisioned SQL (aka SQL Pool, which is exactly what was in SQL Data Warehouse)?_ When you need consistent performance, high performance, or have a large number of queries being consistently run. Provisioned SQL may give you better and more predictable performance due to resource reservation. But the good thing is because both use T-SQL, it is easy to transition back-and-forth between SQL Serverless and a SQL pool * _Most customers are starting to use Delta Lake to store their data. When will SQL OD support this file format?_ We are working on the support for Delta Lake, in the following months there will be more details regarding it and when will it be available * _What is the ADLS Gen2 security?_ There are different methods to access storage account files using SQL serverless. You can use Synapse workspace MSI, SAS, or control ACL on storage level for a particularAAD account
* _Can we read from files stored in Common Data Model (CDM) format within the Data Lake (ADLS Gen2)?_ Yes, at this moment we have a way to read CDM data using the SQL script or provisioned script, which gets metadata from the model.json and creates views over CDM data (native support in the query currently doesn’t exist) * _Why use external tables over views? _ Views are faster and have more features, so avoid external tables if possible. Particularly, virtual functions (filepath and filename are not supported in external tables at this moment which means users cannot do partition elimination on external tables) * _Can you use PBI and AAS with SQL on-demand?_ Yes, you can create views on top of your data lake, use wildcards and filepath/filename functions to expose partitioning column values in a view. That way, for PBI and AAS it is just another object (please make sure that you do proper data type casting of virtual function results as stated in best practices document) * _Why use a logical data warehouse instead of just using AAS or PBI instead?_ Hot analytics is one reason as well as a logical data warehouse that can be used to speed up user request to delivery time. Ideally, there would be no ETLs created (it might not be feasible in all cases, so providing another perspective here) * _Is there any cost control?_ Yes, cost control will be available per workspace and time period daily/weekly/monthly. Also, cost estimation for later * _What about CI/CD?_ When Synapse goes GA there will be improvements overall in CI/CD area, where customers can create group/department and scope the permissions of particular teams to the artifacts andcapabilities
* _Can I use other tools with SQL on-demand?_ Yes, there is a default endpoint (i.e. workspacejs1966-ondemand.sql.azuresynapse.net) for this service that is provided within every Azure Synapse workspace. Any tool capable of establishing TDS connection SQL offerings can connect to and query Synapse SQLon-demand
endpoint. You can connect with Azure Data Studio or SSMS and run ad-hoc queries or connect with Power BI to gain insights. You can alsouse Private Link
to
bring your SQL on-demand endpoint into your managed workspace VNetMore info:
Talking about Azure Synapse on Microsoft Mechanics! Azure Synapse Analytics – Getting Started with SQL On Demand AZURE SYNAPSE ANALYTICS – THE ESSENTIAL SQL ON-DEMAND CHEAT SHEET THINGS TO KNOW ABOUT SQL ON-DEMAND (SERVERLESS QUERY) IN AZURESYNAPSE ANALYTICS
Power BI and SQL on-demand Posted in Azure Synapse Analytics/SQL DW,
SQLServerPedia Syndication| 2 Comments
← Older posts
*
Search for:
*
ABOUT JAMES SERRA
I am a big data and data warehousing solution architect at Microsoft. Previously I was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. I am a prior SQL Server MVP with over 35 years of IT experience.*
FOLLOW ME!
*
SUBSCRIBE BY EMAIL
Enter your email address:*
RECENT POSTS
* Azure Stack and Azure Arc for data services * External tables vs T-SQL views on files in a data lake * Top Azure Synapse Analytics and Power BI questions * Synapse and federated queries * Microsoft Ignite Announcements 2020 * Azure Synapse Analytics overlooked features * Query options in Azure Synapse Analytics * Data Lakehouse & Synapse * Cosmos DB for the SQL Professional – Referencing Tables * SQL on-demand in Azure Synapse Analytics*
RECENT COMMENTS
* Daren Child on External tables vs T-SQL views on files in a datalake
* James Serra on External tables vs T-SQL views on files in a data lake * External Tables vs T-SQL Views in Synapse – Curated SQL on External tables vs T-SQL views on files in a data lake * James M on External tables vs T-SQL views on files in a data lake * Top Azure Synapse Analytics and Power BI questions –SQLServerCentral
on Top Azure Synapse Analytics and Power BI questions*
ARCHIVES
* November 2020 (2) * October 2020 (2) * September 2020 (4)* August 2020 (2)
* July 2020 (2)
* June 2020 (2)
* May 2020 (4)
* April 2020 (2)
* March 2020 (3)
* February 2020 (1) * January 2020 (2) * December 2019 (3) * November 2019 (2) * October 2019 (1) * September 2019 (1)* August 2019 (2)
* July 2019 (3)
* June 2019 (2)
* May 2019 (1)
* April 2019 (2)
* March 2019 (2)
* February 2019 (1) * January 2019 (2) * December 2018 (1) * November 2018 (2) * October 2018 (2) * September 2018 (2)* August 2018 (3)
* July 2018 (3)
* June 2018 (3)
* May 2018 (3)
* April 2018 (3)
* March 2018 (4)
* February 2018 (4) * January 2018 (1) * December 2017 (1) * November 2017 (3) * October 2017 (4) * September 2017 (2)* August 2017 (4)
* July 2017 (2)
* June 2017 (4)
* May 2017 (4)
* April 2017 (3)
* March 2017 (6)
* February 2017 (1) * January 2017 (2) * December 2016 (4) * November 2016 (8) * October 2016 (3) * September 2016 (3)* August 2016 (4)
* July 2016 (3)
* June 2016 (2)
* May 2016 (5)
* April 2016 (4)
* March 2016 (5)
* February 2016 (5) * January 2016 (2) * December 2015 (2) * November 2015 (1) * October 2015 (1) * September 2015 (2)* August 2015 (3)
* July 2015 (3)
* June 2015 (1)
* May 2015 (5)
* April 2015 (2)
* March 2015 (1)
* February 2015 (5) * January 2015 (2) * December 2014 (2) * November 2014 (4) * October 2014 (2) * September 2014 (1)* August 2014 (3)
* July 2014 (2)
* June 2014 (1)
* May 2014 (6)
* April 2014 (3)
* March 2014 (6)
* February 2014 (8) * January 2014 (7) * December 2013 (8) * November 2013 (8) * October 2013 (10) * September 2013 (10) * August 2013 (12)* July 2013 (9)
* June 2013 (10)
* May 2013 (10)
* April 2013 (10)
* March 2013 (10)
* February 2013 (10) * January 2013 (8) * December 2012 (10) * November 2012 (11) * October 2012 (10) * September 2012 (9)* August 2012 (4)
* July 2012 (9)
* June 2012 (7)
* May 2012 (6)
* April 2012 (10)
* March 2012 (15)
* February 2012 (14) * January 2012 (13) * December 2011 (12) * November 2011 (12) * October 2011 (14) * September 2011 (14) * August 2011 (15)* July 2011 (15)
* June 2011 (20)
* May 2011 (13)
* April 2011 (3)
*
CATEGORIES
* Active Directory
* Antivirus
* Appliance
* Azure
* Azure Analysis Services* Azure Arc
* Azure Cosmos DB
* Azure Data Factory* Azure Data Lake
* Azure Data Lake Analytics* Azure Databricks
* Azure SQL Database* Azure Stack
* Azure Synapse Analytics/SQL DW* Big Data
* BISM
* Blob Storage
* Blog
* Business Intelligence* Career
* Certification
* Cloud
* Columnstore indexes* Consulting
* Cortana Intelligence Suite* Data Lake
* Data warehouse
* Database Migration Service* DataOps
* DAX
* Delta Lake
* Dimensions
* DNS
* DQS
* ETL
* Excel
* Fast Track
* Fuzzy Lookup
* Hadoop
* HDInsight
* Introduction
* KPI
* MDS/MDM
* MDX
* Microsoft
* PASS
* PDW/APS
* PerformancePoint
* Podcast
* PolyBase
* Power BI
* Power Map
* Power Pivot
* Power Query
* Power View/Project Crescent* Presentation
* ProClarity
* Quick Tip
* Report Builder
* Security
* Session
* SharePoint
* SharePoint 2013
* SQL Server
* SQL Server 2012
* SQL Server 2014
* SQL Server 2016
* SQL Server 2017
* SQL Server 2019
* SQLServerPedia Syndication* SSAS
* SSDT/Juneau
* SSIS
* SSMS
* SSRS
* Tabular
* Team Foundation Server* Tools
* Training
* Videos
* Virtualization
* Visio
* Visual Studio 2008 * Visual Studio 2010 * Visual Studio 2012 * Visual Studio 2013 * Visual Studio vNext* White Paper
* Windows 8
*
TOP POSTS & PAGES
* What is Azure Data Factory? * Relational databases vs Non-relational databases * Factless fact table * SQL on-demand in Azure Synapse Analytics * Ways to access data in ADLS Gen2 * Azure Synapse Analytics & Power BI performance * Why You Need a Data Warehouse * Consultants: Corp-to-Corp vs 1099* Presentations
* Operational Data Store (ODS) Defined*
FAVORITE BLOGS
Chris Webb
Paul Randal
Brent Ozar
Mark Tabladillo
Marco Russo
James Beresford
Boyan Penev
Devin Knight
Thomas LaRock
Denny Lee
Thomas Kejser
Alberto Ferrari
Garrett Edmondson
Sean McCall
Dan English
Sal De Lorea
Jamie Thomson
Bill Anton
*
META
* Log in
* Entries feed
* Comments feed
* WordPress.org
*
I’M A TOP 10 SQL SERVER BLOGGER OF 2018*
I’M A FORMER SQL SERVER MVP*
I’M A ROCKSTAR BLOGGER 2020 - James Serra's Blog Proudly powered by WordPress. Weaver byWeaverTheme.com
Details
Copyright © 2024 ArchiveBay.com. All rights reserved. Terms of Use | Privacy Policy | DMCA | 2021 | Feedback | Advertising | RSS 2.0