Apache Spark is the Taylor Swift of big data software. The open source technology has been around and popular for a few years. But 2015 was the year Spark went from an ascendant technology to a bona fide superstar.
In this post, I will introduce the key concepts and terminology used when working on data science projects with Databricks and Apache Spark.
Data Science Concepts and Terminology
Apache spark: general-purpose distributed data engine for large-scale processing, with libraries for Cloud SQL, streaming, machine learning, and graphs
API: acronym for application programming interface, a software intermediary that allows 2 or more different software components to connect and transfer data to each other
REST API: acronym for representational state transfer, a flexible API that allows applications to transfer data in multiple formats
SOAP API: acronym for simple object access protocol, a highly structured API using xml data format
CTE: acronym for common table expression, in SQL use a WITH statement to define a temporary result set that you can reference possibly multiple times within the query
Data lake: large repository of raw data of various types, including structured data from relational databases, semi-structured data (csv, logs, xml, json), unstructured data (emails, documents, pdfs) and binary data (images, audio, video)
Data lakehouse: platform such as Databricks or Snowflake, that combines data lake and data warehouse to enable data analytics, machine learning, and storage of various data types
Data warehouse: organized set of structured data, designed for data analytics
Delta Lake: open-source table format for data storage, supporting ACID transactions, high-performance query optimizations, schema evolution, and data versioning
ETL: acronym for extract, transform, and load, the process of combining data from multiple sources into a data warehouse
Hadoop: an open source framework based on Java that manages the storage and processing of large amounts of data for applications
JDBC connection: acronym for Java Database Connectivity, the Java API that manages connecting to a database, issuing queries and commands, and handling result sets
JSON: acronym for JavaScript object notation, a text-based format for storing and exchanging data in a way that's both human-readable and machine-readable
Jupyter Notebook: open document format based on JSON, used by data scientists to record code, equations, visualizations, and other computational outputs
ML Flow: open-source platform specifically for machine learning
Looker: business intelligence platform that generates SQL queries and submits them against a database connection, for data analytics and reporting,
Regular expression: a sequence of characters that specifies a match pattern in text, usually used for “find” or “find and replace” operations on strings
Pyspark: API connection between Apache Spark and python, supporting all of Spark’s features including Spark SQL, dataframes, machine learning (MLlib) and spark core
ODBC: acronym for open database connectivity, a standard API for accessing database management systems
UDF: acronym for user-defined function, which allows custom logic to be reused in the user environment
XML: acronym for eXtensible Markup Language, a programming language for storing, transmitting, and reconstructing arbitrary data, defining a set of rules for encoding documents in a format that is both human-readable and machine-readable
Databricks Concepts and Terminology
ADF: acronym for Azure data factory, a cloud-based service that simplifies data integration and orchestration by moving information from diverse sources – on-premises databases, cloud platforms, and SaaS applications
Cluster: set of computation resources and configurations on which you run notebooks and jobs
DBFS: acronym for databricks file system, a distributed file system on Databricks that interacts with cloud-based storage
DBU: acronym for databricks unit, a unit of processing capacity bill on a per-second usage
Overwatch: monitoring tool that collects logs and writes overwatch database nightly, to see the daily usage of each cluster
Photon: high-performance Azure Databricks-native vectorized query engine that runs your SQL workloads and DataFrame API calls faster
Commentaires