Apache Spark_Cheat Sheet
Package/Method | Description | Code Example |
---|---|---|
appName() | A name for your job to display on the cluster web UI. |
|
cache() | An Apache Spark transformation often used on a DataFrame, data set, or RDD when you want to perform multiple actions. cache() caches the specified DataFrame, data set, or RDD in the memory of your cluster's workers. Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on the same DataFrame, data set, or RDD in a single action. |
|
count() | Returns the number of elements with the specified value. |
|
createTempView() | Creates a temporary view that can later be used to query the data. The only required parameter is the name of the view. |
|
filter() | Returns an iterator where the items are filtered through a function to test if the item is accepted or not. |
|
getOrCreate() | Get or instantiate a SparkContext and register it as a singleton object. |
|
import | Used to make code from one module accessible in another. Python imports are crucial for a successful code structure. You may reuse code and keep your projects manageable by using imports effectively, which can increase your productivity. |
|
len() | Returns the number of items in an object. When the object is a string, the len() function returns the number of characters in the string. |
|
map() | Returns a map object (an iterator) of the results after applying the given function to each item of a given iterable (list, tuple, etc.) |
|
pip | To ensure that requests will function, the pip program searches for the package in the Python Package Index (PyPI), resolves any dependencies, and installs everything in your current Python environment. |
|
pip install | The pip install <package> command looks for the latest version of the package and installs it. |
|
print() | Prints the specified message to the screen or other standard output device. The message can be a string or any other object; the object will be converted into a string before being written to the screen. |
|
printSchema() | Used to print or display the schema of the DataFrame or data set in tree format along with the column name and data type. If you have a DataFrame or data set with a nested structure, it displays the schema in a nested tree format. |
|
sc.parallelize() | Creates a parallelized collection. Distributes a local Python collection to form an RDD. Using range is recommended if the input represents a range for performance. |
|
select() | Used to select one or multiple columns, nested columns, column by index, all columns from the list, by regular expression from a DataFrame. select() is a transformation function in Spark and returns a new DataFrame with the selected columns. |
|
show() | Spark DataFrame show() is used to display the contents of the DataFrame in a table row and column format . By default, it shows only twenty rows, and the column values are truncated at twenty characters. |
|
spark.read.json | Spark SQL can automatically infer the schema of a JSON data set and load it as a DataFrame. The read.json() function loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file offered as a JSON file is not a typical JSON file. |
|
spark.sql() | To issue any SQL query, use the sql() method on the SparkSession instance . All spark.sql queries executed in this manner return a DataFrame on which you may perform further Spark operations if required. |
|
time() | Returns the current time in the number of seconds since the Unix Epoch. |
|
DataFrames and Spark SQL
Package/Method | Description | Code Example |
---|---|---|
appName() | A name for your job to display on the cluster web UI. |
|
createDataFrame() | Used to load the data into a Spark DataFrame. |
Creating a DataFrame
|
createTempView() | Create a temporary view that can later be used to query the data. The only required parameter is the name of the view. |
|
fillna() | Used to replace NULL/None values on all or selected multiple DataFrame columns with either zero (0), empty string, space, or any constant literal values. | Replace NULL/None values in a DataFrame
Replace with zero |
filter() | Returns an iterator where the items are filtered through a function to test if the item is accepted or not. |
|
getOrCreate() | Get or instantiate a SparkContext and register it as a singleton object. |
|
groupby() | Used to collect the identical data into groups on DataFrame and perform count, sum, avg, min, max functions on the grouped data. | Grouping data and performing aggregation
|
head() | Returns the first n rows for the object based on position. | Returning the first 5 rows
|
import | Used to make code from one module accessible in another. Python imports are crucial for a successful code structure. You may reuse code and keep your projects manageable by using imports effectively, which can increase your productivity. |
|
pd.read_csv() | Required to access data from the CSV file from Pandas that retrieves data in the form of the data frame. |
Reading data from a CSV file into a DataFrame
|
pip | To ensure that requests will function, the pip program searches for the package in the Python Package Index (PyPI), resolves any dependencies, and installs everything in your current Python environment. |
|
pip install | The pip install <package> command looks for the latest version of the package and installs it. |
|
printSchema() | Used to print or display the schema of the DataFrame or data set in tree format along with the column name and data type. If you have a DataFrame or data set with a nested structure, it displays the schema in a nested tree format. |
|
rename() | Used to change the row indexes and the column labels. |
Create a sample DataFrame
Rename columns
The columns 'A' and 'B' are now renamed to 'X' and 'Y'
|
select() | Used to select one or multiple columns, nested columns, column by index, all columns from the list, by regular expression from a DataFrame. select() is a transformation function in Spark and returns a new DataFrame with the selected columns. |
|
show() | Spark DataFrame show() is used to display the contents of the DataFrame in a table row and column format. By default, it shows only twenty rows, and the column values are truncated at twenty characters. |
|
sort() | Used to sort DataFrame by ascending or descending order based on single or multiple columns. | Sorting DataFrame by a column in ascending order
Sorting DataFrame by multiple columns in descending order
|
SparkContext() | It is an entry point to Spark and is defined in org.apache.spark package since version 1.x and used to programmatically create Spark RDD, accumulators, and broadcast variables on the cluster. |
Creating a SparkContext
|
SparkSession | It is an entry point to Spark, and creating a SparkSession instance would be the first statement you would write to the program with RDD, DataFrame, and dataset |
Creating a SparkSession
|
spark.read.json() | Spark SQL can automatically infer the schema of a JSON data set and load it as a DataFrame. The read.json() function loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file offered as a JSON file is not a typical JSON file. |
|
spark.sql() | To issue any SQL query, use the sql() method on the SparkSession instance. All spark.sql queries executed in this manner return a DataFrame on which you may perform further Spark operations if required. |
|
spark.udf.register() | In PySpark DataFrame, it is used to register a user-defined function (UDF) with Spark, making it accessible for use in Spark SQL queries. This allows you to apply custom logic or operations to DataFrame columns using SQL expressions. | Registering a UDF (User-defined Function)
|
where() | Used to filter the rows from DataFrame based on the given condition. Both filter() and where() functions are used for the same purpose. | Filtering rows based on a condition
|
withColumn() | Transformation function of DataFrame used to change the value, convert the data type of an existing column, create a new column, and many more. | Adding a new column and performing transformations
|
withColumnRenamed() | Returns a new DataFrame by renaming an existing column. | Renaming an existing column
|
留言
張貼留言