Introduction to Databricks and PySpark for SAS Builders
It is a collaborative publish between Databricks and WiseWithData. We thank Founder and President Ian J. Ghent, Head of Pre-Gross sales Options R &D Bryan Chuinkam, and Head of Migration Options R&D Ban (Mike) Solar of WiseWithData for his or her contributions.
Expertise has come a great distance because the days of SAS®-driven knowledge and analytics workloads. The lakehouse structure is enabling knowledge groups to course of all forms of knowledge (structured, semi-structured and unstructured) for various use instances (knowledge science, machine studying, real-time analytics, or traditional enterprise intelligence and knowledge warehousing) all from a single copy of knowledge. Efficiency and capabilities are mixed with magnificence and ease, making a platform that’s unmatched on the planet at the moment. Open-source applied sciences comparable to Python and Apache Spark™ have turn into the #1 language for knowledge engineers and knowledge scientists, largely as a result of they’re easy and accessible.
Many SAS customers are boldly embarking on modernizing their talent units. Whereas Databricks and PySpark are designed to be easy to be taught, it may be a studying curve for knowledgeable practitioners targeted on SAS. However the excellent news for the SAS developer neighborhood is that Databricks embodies the idea of open and easy platform structure and makes it simple for anybody who desires to construct options within the fashionable knowledge and AI cloud platform. This text surfaces a few of the element mapping between the outdated and new world of analytics programming.
Discovering a standard floor
For all their variations, SAS and Databricks have some outstanding similarities. Each are designed from the bottom as much as be unified, enterprise grade platforms. They each let the developer combine and match between SQL and far more versatile programming paradigms. Each assist built-in transformations and knowledge summarization capabilities. Each assist high-end analytical features like linear and logistic regression, determination bushes, random forests and clustering. Additionally they each assist a semantic knowledge layer that abstracts away the small print of the underlying knowledge sources. Let’s do a deeper dive into a few of these shared ideas.
SAS DATA steps vs DataFrames
The SAS DATA step is arguably essentially the most highly effective function within the SAS language. You may have the power to union, be part of, filter and add, take away and modify columns, together with plainly specific conditional and looping enterprise logic. Proficient SAS builders leverage it to construct huge DATA step pipelines to optimize their code and keep away from I/O.
The PySpark DataFrame API has most of those self same capabilities. For a lot of use instances, DataFrame pipelines can specific the identical knowledge processing pipeline in a lot the identical manner. Most significantly DataFrames are tremendous quick and scalable, working in parallel throughout your cluster (with out you needing to handle the parallelism).
Instance SAS Code | Instance PySpark Code |
---|---|
knowledge df1; set df2; x = 1; run; |
knowledge df1; df1 = ( df2 .withColumn('x', lit(1)) ) |
SAS PROC SQL vs SparkSQL
The business customary SQL is the bottom widespread denominator in analytics languages. Virtually all instruments assist it to a point. In SAS, you could have a definite device that may use SQL, referred to as PROC SQL and allows you to work together along with your SAS knowledge sources in a manner that’s acquainted to many who know nothing about SAS. It’s a bridge or a standard language that just about everybody understands.
PySpark has related capabilities, by merely calling spark.sql(), you may enter the SQL world. However with Apache Spark™, you could have the power to leverage your SQL information and may go a lot additional. The SQL expression syntax is supported in lots of locations throughout the DataFrame API, making it a lot simpler to be taught. One other pleasant device for SQL programmers is Databricks SQL with an SQL programming editor to run SQL queries with blazing efficiency on the lakehouse.
Instance SAS Code | Instance PySpark Code |
---|---|
proc sql; create desk sales_last_month as choose customer_id ,sum(trans_amt) as sales_amount from gross sales.pos_sales group by customer_id order by customer_id; give up; |
gross sales['sales'].createOrReplaceTempView('gross sales') work['sales_last_month'] = spark.sql(""" SELECT customer_id , sum(trans_amt) AS sales_amount FROM gross sales GROUP BY customer_id ORDER BY customer_id """) |
Base SAS Procs vs PySpark DataFrame transformations
SAS packages up a lot of its pre-built capabilities into procedures (PROCs). This consists of transformations like knowledge aggregation and abstract statistics, in addition to knowledge reshaping, importing/exporting, and so on. These PROCs characterize distinct steps or course of boundaries in a big job. In distinction, those self same transformations in PySpark can be utilized wherever, even inside a DataFrame pipeline, giving the developer way more flexibility. After all, you may nonetheless break them up into distinct steps.
Instance SAS Code | Instance PySpark Code |
---|---|
proc means knowledge=df1 max min; var MSRP Bill; the place Make="Acura"; output out = df2; run; |
df2 = ( df1.filter("Make="Acura"") .choose("MSRP", "Bill") .abstract('max','min') ) |
Lazy execution – SAS “run” assertion vs PySpark actions
The lazy execution mannequin in Spark is the muse of so many optimizations, which permits PySpark to be a lot quicker than SAS. Imagine it or not, SAS additionally has assist for lazy execution! Spectacular for a language designed over 50 years in the past. You realize all these “run” (and “give up”) statements you’re pressured to put in writing in SAS? They’re truly its personal model of PySpark actions.
In SAS, you may outline a number of steps in a course of, however they don’t execute till the “run” is known as. The primary distinction between SAS and PySpark isn’t the lazy execution, however the optimizations which can be enabled by it. In SAS, sadly, the execution engine can be “lazy,” ignoring all of the potential optimizations. Because of this, lazy execution in SAS code isn’t used, as a result of it doesn’t assist efficiency.
So the following time you’re confused by the lazy execution mannequin in PySpark, simply keep in mind that SAS is similar, it’s simply that no one makes use of the function. Your Actions in PySpark are just like the run statements in SAS. In reality, if you wish to set off fast execution in PySpark (and retailer intermediate outcomes to disk), similar to the run assertion, there’s an Motion for that. Simply name “.checkpoint()” in your DataFrame.
Instance SAS Code | Instance PySpark Code |
---|---|
knowledge df1; set df2; x = 1; run; |
df1 = ( df2 .withColumn('x', lit(1)) ).checkpoint() |
Superior analytics and Spark ML
Over the previous 45 years, the SAS language has amassed some important capabilities for statistics and machine studying. The SAS/STAT procedures package deal up huge quantities of capabilities inside their odd and inconsistent syntax. On the other-hand, SparkML consists of capabilities that cowl a lot of the fashionable use instances for STAT, however in a extra cohesive and constant manner.
One notable distinction between these two packages is the general method to telemetry and diagnostics. With SAS, you get a whole dump of each and all statistical measures if you do a machine studying job. This may be complicated and inefficient for contemporary knowledge scientists.
Sometimes, knowledge scientists solely want one or a small set of mannequin diagnostics they like to make use of to evaluate their mannequin. That’s why SparkML takes a distinct and extra modular method by offering APIs that allow you to get these diagnostics on request. For giant knowledge units, this distinction in method can have important efficiency implications by avoiding computing statistics that don’t have any use.
It’s additionally price noting that every thing that’s within the PySpark ML library are parallelized algorithms, so they’re much quicker. For these purists on the market, sure we all know a single-threaded logistic regression mannequin may have a barely higher match. We’ve all heard that argument, however you’re completely lacking the purpose right here. Quicker mannequin growth means extra iterations and extra experimentation, which ends up in significantly better fashions.
Instance SAS Code | Instance PySpark Code |
---|---|
proc logistic knowledge=ingots; mannequin NotReady = Warmth Soak; run; |
vector_assembler = VectorAssembler(inputCols=['Heat', 'Soak'], outputCol="options") v_df = vector_assembler.remodel(ingots).choose(['features', 'NotReady']) lr = LogisticRegression(featuresCol="options", labelCol="NotReady") lr_model = lr.match(v_df) lr_predictions = lr_model.remodel(v_df) lr_evaluator = BinaryClassificationEvaluator( rawPredictionCol="rawPrediction", labelCol="NotReady") print('Space Below ROC', lr_evaluator.consider(lr_predictions)) |
Within the PySpark instance above, the enter columns “Warmth, Soak” are mixed right into a single function vector utilizing the VectorAssembler API. A logistic regression mannequin is then skilled on the reworked knowledge body utilizing the LogisticRegression algorithm from SparkML library. To print the AUC metric,the BinaryClassificationEvaluator is used with predictions from the skilled mannequin and the precise label as inputs. This modular method offers higher management in calculating the mannequin efficiency metrics of selection.
The variations
Whereas there are various commonalities between SAS and PySpark, there are additionally a variety of variations. As a SAS knowledgeable studying PySpark, a few of these variations may be very troublesome to navigate. Let’s break them into just a few totally different classes. There are some SAS options that aren’t obtainable natively in PySpark, after which there are issues that simply require a distinct device or method in PySpark.
Totally different ecosystem
The SAS platform is an entire assortment of acquired and internally developed merchandise, lots of which work comparatively properly collectively. Databricks is constructed on open requirements, as such, you may simply combine 1000’s of instruments. Let’s check out some SAS-based instruments and capabilities obtainable in Databricks for related use instances.
Let’s begin with SAS® Information Integration Studio (DI Studio). Powered by an advanced metadata pushed mannequin, DI Studio fills some necessary roles within the SAS ecosystem. It primarily gives a manufacturing job circulate orchestration functionality for ETL workloads. In Databricks, knowledge engineering pipelines are developed and deployed utilizing Notebooks and Jobs. Information engineering duties are powered by Apache Spark (the de-facto business customary for large knowledge ETL).
Databricks’ Delta Reside Tables(DLT) and Job orchestrations additional simplifies ETL pipeline growth on the Lakehouse structure. DLT gives a dependable framework to declaratively create ETL pipelines as an alternative of conventional procedural sequence of transformation. That means, the consumer describes the specified outcomes of the pipeline with out explicitly itemizing the ordered steps that have to be carried out to reach on the end result. DLT engine intelligently figures out “how” the compute framework ought to perform these processes.
The opposite key function that DI Studio performs is to offer knowledge lineage monitoring. This function, nonetheless, solely works correctly if you set every thing up excellent and manually enter metadata on all code nodes (a really painful course of). In distinction, DLT ensures that the generated pipeline mechanically captures the dependency between datasets, which is used to find out the execution order when performing an replace and recording lineage data within the occasion log for a pipeline.
Whereas most knowledge scientists are very comfortable coders, some desire point-and-click knowledge mining instruments. There’s an rising time period for these of us, referred to as “citizen knowledge scientists,” whose persona is analytical, however not deeply technical. In SAS, you could have the very costly device SAS® Enterprise Miner to construct fashions with out coding. This device, with its consumer interface from a bygone period, lets customers pattern, discover, modify, mannequin and assess their SAS knowledge all from the consolation of their mouse, no keyboard required. One other level and click on device in SAS, referred to as SAS® Enterprise Information, is the preferred interface to each SAS programming and point-and-click evaluation. Due to SAS’ complicated syntax, many individuals prefer to leverage the point-and-click instruments to generate SAS code that they then modify to swimsuit their wants.
With PySpark, the APIs are less complicated and extra constant, so the necessity for helper instruments is decreased. After all the fashionable option to do knowledge science is through notebooks, and the Databricks pocket book does a terrific job at taking away coding for duties that must be level and click on, like graphing out your knowledge. Exploratory evaluation of knowledge and mannequin growth in Databricks is carried out utilizing Databricks ML Runtime from Databricks Notebooks. With Databricks AutoML, customers are supplied a point-n-click choice to shortly practice and deploy a mannequin. Databricks AutoML takes a “glass-box” method by producing editable, shareable notebooks with baseline fashions that combine with MLflow Monitoring and greatest practices to offer a modifiable start line for brand new tasks.
With the most recent acquisition of 8080 Labs, a brand new functionality that can be coming to Databricks notebooks and workspace is performing knowledge exploration and analytics utilizing low code/no-code. The bamboolib package deal from 8080 Labs mechanically generates Python code for consumer actions carried out through point-n-click.
Placing all of it collectively, Lakehouse structure powered by open supply Delta Lake in Databricks simplifies knowledge architectures and permits storing all of your knowledge as soon as in an information lake and doing AI and BI on that knowledge straight.
The diagram above reveals a reference structure of Databricks deployed on AWS (the structure can be related on different cloud platforms) supporting totally different knowledge sources, use instances and consumer personas all by way of one unified platform. Information engineers get to simply use open file codecs comparable to Apache Parquet, ORC together with in-built efficiency optimization, transaction assist, schema enforcement and governance.
Information engineers now need to do much less plumbing work and concentrate on core knowledge transformations for utilizing streaming knowledge with in-built structured streaming and Delta Lake tables. ML is a first-class citizen within the lakehouse, which implies knowledge scientists don’t waste time subsampling or transferring knowledge to share dashboards. Information and operational analysts can work off the identical knowledge layer as different knowledge stakeholders and use their beloved SQL programming language to investigate knowledge.
Totally different approaches
As with all adjustments, there are some belongings you simply have to adapt. Whereas a lot of the performance of SAS programming exists in PySpark, some options are meant for use in a completely totally different manner. Listed below are just a few examples of the forms of variations that you simply’ll have to adapt to, as a way to be efficient in PySpark.
Procedural SAS vs Object Oriented PySpark
In SAS, most of your code will find yourself as both a DATA step or a process. In each instances, you’ll want to at all times explicitly declare the enter and output datasets getting used (i.e. knowledge=dataset). In distinction, PySpark DataFrames use an object oriented method, the place the DataFrame reference is connected to the strategies that may be carried out on it. Typically, this method is much extra handy and extra suitable with fashionable programming strategies. However, it could possibly take some getting used to, particularly for builders which have by no means achieved any object-oriented programming.
Instance SAS Code | Instance PySpark Code |
---|---|
proc kind knowledge=df1 out=dedup nodupkey; by cid; run; |
dedup=df1.dropDuplicates(['cid']).orderBy(['cid']) |
Information reshaping
Let’s take, for instance, the widespread job of knowledge reshaping in SAS, notionally having “proc transpose.” Transpose, sadly, is severely restricted as a result of it’s restricted to a single knowledge sequence. Meaning for sensible functions you must name it many occasions and glue the information again collectively. Which may be an appropriate follow on a small SAS dataset, nevertheless it may trigger hours of extra processing on a bigger dataset. Due to this limitation, many SAS builders have developed their very own knowledge reshaping strategies, many utilizing some mixture of DATA steps with retain, arrays and macro loops. This reshaping code typically finally ends up being 100’s of traces of SAS code, however is essentially the most environment friendly option to execute the transformation in SAS.
Lots of the low-level operations that may be carried out in a DATA step are simply not obtainable in PySpark. As an alternative, PySpark gives a lot less complicated interfaces for widespread duties like knowledge reshaping with the groupBy().pivot() transformation, which helps a number of knowledge sequence concurrently.
Instance SAS Code | Instance PySpark Code |
---|---|
proc transpose knowledge=take a look at out=xposed; by var1 var2; var x; id y; run |
xposed = (take a look at .groupBy('var1','var2') .pivot('y') .agg(final('x')) .withColumn('_name_',lit('y')) ) |
Column oriented vs. business-logic oriented
In most knowledge processing programs, together with PySpark, you outline business-logic throughout the context of a single column. SAS in contrast has extra flexibility. You possibly can outline giant blocks of business-logic inside a DATA step and outline column values inside that business-logic framing. Whereas this method is extra expressive and versatile, it may also be problematic to debug.
Altering your mindset to be column oriented isn’t that difficult, nevertheless it does take a while. In case you are proficient in SQL, it ought to come fairly simply. What’s extra problematic is adapting present business-logic code right into a column-oriented world. Some DATA steps include 1000’s of traces of business-logic oriented code, making handbook translation a whole nightmare.
Instance SAS Code | Instance PySpark Code |
---|---|
knowledge output_df; set input_df; if x = 5 then do; a = 5; b = 6; c = 7; finish; else if x = 10 then do; a = 10; b = 11; c = 12; finish; else do; a = 1; b = -1; c = 0; finish; run; |
output_df = ( input_df .withColumn('a', expr("""case when (x = 5) then 5 when (x = 10) then 10 else 1 finish""")) .withColumn('b', expr("""case when (x = 5) then 6 when (x = 10) then 11 else -1 finish""")) .withColumn('c', expr("""case when (x = 5) then 7 when (x = 10) then 12 else 0 finish""")) ) |
The lacking options
There are a selection of highly effective and necessary options in SAS that simply don’t exist in PySpark. When you could have your favourite device within the toolbox, and immediately it’s lacking, it doesn’t matter how fancy or highly effective the brand new instruments are; that trusty Japanese Dozuki noticed remains to be the one device for some jobs. In modernizing with PySpark, you’ll certainly encounter these lacking instruments that you’re used to, however don’t fret, learn on and we’ve received some excellent news. First let’s speak about what they’re and why they’re necessary.
Superior SAS DATA step options
Let’s say you wish to generate new rows conditionally, hold the outcomes from earlier rows calculations or create totals and subtotals with embedded conditional logic. These are all duties which can be comparatively easy in our iterative SAS DATA step API, however our trusty PySpark DataFrame is simply not outfitted to simply deal with.
Some knowledge processing duties have to have full fine-grained management over the entire course of, in a “row iterative” method. Such duties aren’t suitable with PySpark’s shared-nothing MPP structure, which assumes rows may be processed utterly independently of one another. There are solely restricted APIs just like the window operate to take care of inter-row dependencies. Discovering options to those issues in PySpark may be very irritating and time consuming.
Instance SAS Code |
---|
knowledge df2; set df; by customer_id seq_num; retain counter; label = " "; if first.customer_id then counter = 0; else counter = counter+1; output; if final.customer_id then do; seq_num = .; label = "Whole"; output; finish; run; |
Customized codecs and informats
SAS codecs are outstanding of their simplicity and usefulness. They supply a mechanism to reformat, remap and characterize your knowledge in a single device. Whereas the built-in codecs are helpful for dealing with widespread duties comparable to outputting a date string, they’re additionally helpful for numeric and string contexts. There are related instruments obtainable in PySpark for these use instances.
The idea of customized codecs or informats is a distinct story. They assist each a easy mapping of key-value pairs, but additionally a mapping by vary and assist default values. Whereas some use instances may be labored round through the use of joins, the comfort and concise syntax codecs supplied by SAS isn’t one thing that’s obtainable in PySpark.
Instance SAS Code |
---|
proc format; worth prodcd 1='Sneakers' 2='Boots' 3='Sandals' ; run; knowledge sales_orders; set sales_orders; product_desc = put(product_code, prodcd.); run; |
The library idea & entry engines
Some of the widespread complaints from SAS builders utilizing PySpark is that it lacks a semantic knowledge layer built-in straight into the core end-user API (i.e. Python session). The SAS knowledge library idea is so acquainted and ingrained, it’s exhausting to navigate with out it. There’s the comparatively new Catalog API in PySpark, however that requires fixed calling again to the shop and having access to what you need. There’s not a option to simply outline a logical knowledge retailer and get again DataFrame objects for every desk . Most SAS builders switching to PySpark don’t like having to name spark.learn.jdbc to entry every Database desk, they’re used to the entry engine library idea, the place all of the tables in a database are at your fingertips.
Instance SAS Code |
---|
libname lib1 ‘path1’; libname lib2 ‘path2’; knowledge lib2.dataset; set lib1.dataset; run; |
Fixing for the variations – the SPROCKET Runtime
Whereas lots of SAS language ideas are now not related or helpful, the lacking options we simply mentioned are certainly very helpful, and in some instances virtually unattainable to reside with out. That’s why WiseWithData has developed a particular plugin to Databricks and PySpark that brings these acquainted and highly effective options into the fashionable platform, the SPROCKET Runtime. It’s a key a part of how WiseWithData is ready to mechanically migrate SAS code into Databricks and PySpark at unbelievable speeds, whereas offering a 1-to-1 code conversion expertise.
The SPROCKET libraries & database entry engines
SPROCKET libraries allow you to quick monitor your analytics with a simplified option to entry your knowledge sources, similar to SAS language libraries idea. This highly effective SPROCKET Runtime function means no extra messing about with knowledge paths and JDBC connectors, and entry to all of your knowledge in a single line of code. Merely register a library and have all of the related DataFrames able to go.
SAS | SPROCKET |
---|---|
libname lib ‘path’; lib.dataset |
register_library(‘lib’, ‘path’) lib[‘dataset’] |
Customized codecs / informats
With the SPROCKET Runtime, you may leverage the facility & simplicity of customized Codecs & Informats to remodel your knowledge. Rework your knowledge inside PySpark DataFrames utilizing customized codecs similar to you probably did in your SAS setting.
SAS | SPROCKET |
---|---|
proc format; worth prodcd 1='Sneakers' 2='Boots' 3='Sandals' ; run; knowledge sales_orders; set sales_orders; product_desc = put(product_code, prodcd.); run; |
value_formats = [ {'fmtname': 'prodcd', 'fmttype': 'N', 'fmtvalues': [ {'start': 1, 'label': 'Shoes'}, {'start': 2, 'label': 'Boots'}, {'start': 3, 'label': 'Sandals'}, ]}] register_formats(spark, 'work', value_formats) work['sales_orders'] = ( work['sales_orders'] .remodel(put_custom_format( 'product_desc', 'product_code', ‘prodcd')) ) |
Macro variables are a strong idea within the SAS language. Whereas there are some related ideas in PySpark, it’s simply not the identical factor. That’s why we’ve introduced this idea into our SPROCKET Runtime, making it simple to make use of these ideas in PySpark.
SAS | SPROCKET |
---|---|
%let x=1; &x “value_&x._1” |
set_smv(‘x’, 1) get_smv(‘x’) “value_{x}_1”.format(**get_smvs()) |
Superior SAS DATA step and the Row Iterative Processing Language (RIPL API)
The pliability of the SAS DATA step language is offered as a PySpark API throughout the SPROCKET Runtime. Need to use by-group processing, retained columns, do loops, and arrays? The RIPL API is your greatest pal. The RIPL API brings again the acquainted business-logic-oriented knowledge processing view. Now you may specific enterprise logic in acquainted if/else conditional blocks in Python. All of the options you realize and love, however with the convenience of Python and the efficiency and scalability of PySpark.
SAS | SPROCKET – RIPL API |
---|---|
knowledge df2; set df; by customer_id seq_num; retain counter; label = " "; if first.customer_id then counter = 0; else counter = counter+1; output; if final.customer_id then do; seq_num = .; label = "Whole"; output; finish; run; |
def ripl_logic(): rdv['label'] = ' ' if rdv['_first_customer_id'] > 0: rdv['counter'] = 0 else: rdv['counter'] = rdv['counter']+1 output() if rdv['_last_customer_id'] > 0: rdv['seq_num'] = ripl_missing_num rdv['label'] = 'Whole' output() work['df2'] = ( work['df'] .remodel(ripl_transform( by_cols=['customer_id', 'seq_num'], retain_cols=['counter']) ) |
Retraining is at all times exhausting, however we’re right here to assist
This journey towards a profitable migration may be complicated, even irritating. However you’re not alone, 1000’s of SAS-based professionals are becoming a member of this worthwhile journey with you. WiseWithData and Databricks are right here to assist you with instruments, sources and useful hints to make the method simpler.
Strive the course, Databricks for SAS Customers, on Databricks Academy to get a fundamental hands-on expertise with PySpark programming for SAS programming language constructs and contact us to be taught extra about how we are able to help your SAS workforce to onboard their ETL workloads to Databricks and allow greatest practices.
SAS® and all different SAS Institute Inc. services or products names are registered emblems or emblems of SAS Institute Inc. within the USA and different nations. ® signifies USA registration.