Here are my worked examples from the very useful LinkedIn Learning course: PySpark by Example by Jonathan Fernandes : https://www.linkedin.com/learning/apache-pyspark-by-example
Over the past 12 months or so I have been learning and playing with Apache Spark. I went through the brilliant book by Bill Chambers and Matei Zaharia, Spark: The Definitive Guide, that covers Spark in depth and gives plenty of code snippets one can try out in the
spark-shell. Whilst the book is indeed very detailed and provides great examples, the datasets that are included for you to get your hands on are on the order of
Mb's (with the exception of the
activity-data dataset used for the Streaming examples).
Calling compiled Scala code inside the JVM from Python using PySpark
There is no doubt that Java and Scala are the de-facto languages for Data Engineering, whilst Python is certainly the front runner for language of choice with Data Scientists. Spark; a framework for distributed data analytics is written in Scala but allows for usage in Python, R and Java. Interoperability between Java and Scala is a no briner since Scala compiles down to Java byte code, but call Scala from Python is a little more involved, but the process is very simple.
This post walks through the steps involved if you want to fork a public Github repository, privately. It will show how to have an open public repository and how to mirror it in a private repository on Github
These steps were inspired from this guide of 'Mirroring a repository' on Github documentation