Get started with PolyBase in SQL Server 2016

Polybase: This is certainly one of the most exciting features of SQL Server 2016 , the latest version of Microsoft’s database server. This tool provides a transparent access layer that makes it easy to connect the database engine to external data sources, structured or unstructured. PolyBase is also optimized for warehouse data workloads and analytics. This makes it a simple tool to merge the Big Data world into the SQL Server universe .

However, PolyBase is not new. This tool was already part of the Microsoft Analytics Platform System, with the name Parallel Data Warehouse. PolyBase creates a gateway between SQL Server and external data sources, allowing you to use T-SQL to query data as you can with traditional tables.

PolyBase: how it works

With PolyBase, you can access data contained in Hadoop or Azure Blob Storage clusters, but with the same SQL Server tools used to create statistics, or rely on BI applications. You can even create queries on semi-structured data that has been associated with SQL Server datasets. Its strength is that it is integrated directly into SQL Server and does not require third-party tools to perform identical operations or skills in Java, Map / Reduce, Hive or other concepts associated with Hadoop. You rely on existing SQL Server tools and T-SQL to explore unstructured and semi-structured data, all in a SQL Server environment.

If you use PolyBase to query Hadoop, SQL Server performs the bulk of the processing on the Hadoop cluster , where the data is located. The Polybase engine determines when to generate MAP treatments as needed. In addition, this engine parallelizes operations when importing data from Hadoop or Azure for example.

Another advantage: the work of PolyBase remains transparent for the application that performs queries because the external data sources are integrated into the schema of the database. You can query and integrate data from multiple systems without having to transform applications.

Start with PolyBase

To install PolyBase, you must first select the PolyBase Query Service for External Data function in the SQL Server installation. The module can only be installed on a single instance per machine. The installation adds 3 databases (DWConfiguration, DWDiagnostics and DWQueue) and 2 services (SQL Server PolyBase Engine and SQL Server PolyBase Data Movement Service).

Currently, Cloudera Hadoop and Hortonworks are supported as well as Azure Blob Storage. Microsoft plans to add additional data sources with future versions of SQL Server.

After you install SQL Server with PolyBase, you must run the sp_configure stored procedure to enable Polybase. From there, you can follow the steps below to add the object of a schema to your database:

Create an external data source object to connect to a Hadoop file system or Azure Blob Storage. For such an object, use the CREATE EXTERNAL DATA SOURCE command.

Create an external file format object to specify the format of Hadoop or Azure data. Today, PolyBase supports 3 formats: delimited text, Hive RCFile, Hive ORC. To set such a format, use the CREATE EXTERNAL FILE FORMAT command.

Create one or more external tables based on data sources and formats. Tables provide a schema-based structure for referencing external data. They can be referenced as SQL Server relational tables. For a table, use the CREATE

EXTERNAL TABLE command.

By opening SQL Server Management Studio, you will find several new folders in Objects Explorer at the database level. The first is the External Resources subfolder. It contains all the data sources and file formats defined. The Folder folder will also include the External Tables subfolder that groups all the defined external tables.

After creating your external tables, you can use the CREATE STATISTICS command to create query optimization statistics for these same tables. When this is in place, the database engine imports the data into a temporary SQL Server table.

Microsoft has also included several dynamic views to SQL Server to handle PolyBase queries. Added to this are options for retrieving data from these external sources of data, files and tables that were then defined.

By placing Polybase directly in SQL Server, Microsoft has taken a key step: bringing unstructured and semi-structured data closer to structured data. This data can be exposed to the brand’s BI tools, such as Power Pivot and Power Query, as well as other third-party tools such as Tableau and Cognos.