top of page

When we learn new skills like programming, there’s a tendency to unknowingly pick up some bad habits. The problem is that unless someone tells you that you’re doing it wrong, those bad habits may stick with you in the long run. This post will cover the good coding habits that you can develop early, so that you don’t have to break bad ones later on. 


Practice Makes Perfect

Practice programming regularly - everyday if possible, but it’s totally fine to skip a day or two. Try to get out of “tutorial hell” as soon as possible and instead do small daily chunks of a personal project. It doesn’t matter what the project is, as long as it’s challenging and interesting enough to keep you motivated.


The first personal projects that I did were for automating daily tasks for work, such as producing weekly revenue reports and sending reminder emails to clients with overdue invoices. When my friend group did secret santa a few years ago, I wrote a quick Google apps script to randomly assign people and then send everyone an email notification letting them know who their secret santa was.


I’ve gradually challenged myself with larger projects like gohan_dousuru, an automation project to generate meal plans and grocery lists for daily cooking.


Here are some of the personal projects on my list, in order from easiest to most difficult.


  1. Create an email newsletter template in html.

  2. Add a cumulative visitor counter to my personal website.

  3. Web scraping and text me when a new issue of my favourite manga comes out.

  4. Build an app to manage my kitchen pantry inventory.

  5. Computer adaptation of a card or board game like Quarto, Battle Line, or poker. ※ I am not an expert on copyright law, but be careful with fan games because they could cause legal trouble. Do not make them distributable such as by putting it on a public Github repo and creating a link. 


Even if you’re just working alone on a hobby project, it’s good practice to use a version control system like Git and do atomic commits. 



If possible, find someone to do pair programming, which often results in cleaner and more efficient code. Try to also get in the habit of gracefully using loops, dictionaries, and dynamic variable names for more elegant code. 


Finally, well-documented code is a valuable asset to any team. If you publish on github, include a readme file that concisely explains how to set up the software. The documentation can also include use cases, debugging scenarios, and acknowledgements for any resources that you found helpful and would like to give credit to. 

​​Prophet is an open-source python library for predictive data analysis, developed by Facebook's Core Data Science Team and released in 2017. Compared to other models like ARIMA which assume a linear relationship between past and future values, Prophet's strength is that it automatically detects "changepoints" and accordingly adjusts forecasts, making it robust to outliers and flexible in handling seasonality and non-linear trends.

 

Prophet works as a statistical model, not machine learning. It uses an additive model to capture key components of time series data: trend, seasonality, and holidays. The approach relies on curve-fitting rather than optimization of parameters through iterative learning. Although Prophet doesn't inherently use a separate training and test dataset, you can manually split your data into historical data for fitting the model and future data for evaluating the model's performance.

 

How to Improve Forecasting Accuracy

Oddly enough, one of the main draws of Prophet is also one of its core weaknesses. The approach for handling changepoints can result in both underfitting and overfitting.

 

Here are the methods that I previously used when implementing prophet for weekly revenue forecasting.


Plot the Historical Data

This isn't specific to Prophet, but it's best practice to do a simple plot of your historical data before jumping into the forecast. The visual may help you to notice any unexpected trends or gaps that could skew the forecast.


Remove Known Noise Data

Although Prophet is inherently robust to outliers, you should still exclude data that is obviously "wrong" or unhelpful. For example, data relating to internal test campaigns or an unwanted spike in reseller activity. It's usually easier to do the data cleansing before putting data in Prophet.


External Regressors

Prophet is designed for univariate time series analysis. Multiple variables can be added as external regressors.


The business that I was looking at did sales at seemingly random times throughout the year, so I couldn’t just use holiday or yearly seasonality. Instead, I relied almost solely on variables (“flags”) to indicate whether a sale occurred that week, and whether a sale started or ended that week.

model.add_regressor('sale_y_n', prior_scale=40.0)

For multiple related output variables, just model each one separately as univariate time series, or use the forecast values of one variable as inputs to the next. You can also consider more advanced models like VAR or LSTM.


Hyperparameter Tuning

In the previous code snippet, I have set prior_scale=40.0, indicating that this regressor should be heavily factored.


In Prophet, the below hyperparameters are set as default, so you may need to tune them to balance underfitting and overfitting.

changepoint_range=0.8
changepont_prior_scale=0.05
seasonality_prior_scale=10
holidays_prior_scale=10
fourier_order=10

Data Normalization

Since Prophet is a statistical model, it does not require the same level of data flattening as machine learning models. However, some data normalization may be beneficial depending on what you're working with. In my original dataset, the sale_y_n variable was an integer between 0 and 7, indicating how many day out of that week were within a sale period. The raw scale of 0 to 7 may introduce bias to the forecast, so I first normalized the sale_y_n variable using sklearn MinMaxScaler.

 

How to Evaluate the Prediction Accuracy of Facebook Prophet Forecasts

By default, prophet uses 80% confidence interval, meaning there's an 80% chance that the actual value will fall between yhat_lower and yhat_upper. However, Prophet's performance can be hit or miss depending on the use case. If it's still not working, try multiple approaches and use the best model that performs well on cross-validation, using the following metrics to evaluate the accuracy.

  • Mean absolute error (MAE): measures the average magnitude of errors, using the same unit as the data

  • Root mean squared error (RSME): measures the square root of the average of squared differences. Squaring puts more weight on larger errors, so RSME is a useful metrics when larger errors are especially costly. What constitutes a "good" RMSE depends

  • Mean absolute percentage error (MAPE): measures the average of the absolute percentage errors, expressed as a percentage. Generally MAPE under 10% is considered very good, 0-20% is considered good, and even up to 50% can also be acceptable in some use cases.


Follow the parsimony gradient - start with the simplest model and end at the most complex as needed. The simplest option is a naive model, or a seasonal naive model if you have significant seasonality or repetitive sales patterns. Another straightforward method is to use a simple average or exponential smoothing with seasonal adjustments. For greater flexibility, look into machine learning models such as XGBoost, LightGBM, or Random Forest Regressor.


Apache Spark is the Taylor Swift of big data software. The open source technology has been around and popular for a few years. But 2015 was the year Spark went from an ascendant technology to a bona fide superstar.

In this post, I will introduce the key concepts and terminology used when working on data science projects with Databricks and Apache Spark.


Data Science Concepts and Terminology

  • Apache spark: general-purpose distributed data engine for large-scale processing, with libraries for Cloud SQL, streaming, machine learning, and graphs

  • API: acronym for application programming interface, a software intermediary that allows 2 or more different software components to connect and transfer data to each other

    • REST API: acronym for representational state transfer, a flexible API that allows applications to transfer data in multiple formats

    • SOAP API: acronym for simple object access protocol, a highly structured API using xml data format

  • CTE: acronym for common table expression, in SQL use a WITH statement to define a temporary result set that you can reference possibly multiple times within the query

  • Data lake: large repository of raw data of various types, including  structured data from relational databases, semi-structured data (csv, logs, xml, json), unstructured data (emails, documents, pdfs) and binary data (images, audio, video)

  • Data lakehouse: platform such as Databricks or Snowflake, that combines data lake and data warehouse to enable data analytics, machine learning, and storage of various data types

  • Data warehouse: organized set of structured data, designed for data analytics 

  • Delta Lake: open-source table format for data storage, supporting ACID transactions, high-performance query optimizations, schema evolution, and data versioning 

  • ETL: acronym for extract, transform, and load, the process of combining data from multiple sources into a data warehouse 

  • Hadoop: an open source framework based on Java that manages the storage and processing of large amounts of data for applications

  • JDBC connection: acronym for Java Database Connectivity, the Java API that manages connecting to a database, issuing queries and commands, and handling result sets

  • JSON: acronym for JavaScript object notation, a text-based format for storing and exchanging data in a way that's both human-readable and machine-readable

  • Jupyter Notebook: open document format based on JSON, used by data scientists to record code, equations, visualizations, and other computational outputs

  • ML Flow: open-source platform specifically for machine learning

  • Looker: business intelligence platform that generates SQL queries and submits them against a database connection, for data analytics and reporting, 

  • Regular expression: a sequence of characters that specifies a match pattern in text, usually used for “find” or “find and replace” operations on strings

  • Pyspark: API connection between Apache Spark and python, supporting all of Spark’s features including Spark SQL, dataframes, machine learning (MLlib) and spark core

  • ODBC: acronym for open database connectivity, a standard API for accessing database management systems

  • UDF: acronym for user-defined function, which allows custom logic to be reused in the user environment

  • XML: acronym for eXtensible Markup Language, a programming language for storing, transmitting, and reconstructing arbitrary data, defining a set of rules for encoding documents in a format that is both human-readable and machine-readable


Databricks Concepts and Terminology

  • ADF: acronym for Azure data factory, a cloud-based service that simplifies data integration and orchestration by moving information from diverse sources – on-premises databases, cloud platforms, and SaaS applications

  • Cluster: set of computation resources and configurations on which you run notebooks and jobs

  • DBFS: acronym for databricks file system, a distributed file system on Databricks that interacts with cloud-based storage

  • DBU: acronym for databricks unit, a unit of processing capacity bill on a per-second usage

  • Overwatch: monitoring tool that collects logs and writes overwatch database nightly, to see the daily usage of each cluster

  • Photon: high-performance Azure Databricks-native vectorized query engine that runs your SQL workloads and DataFrame API calls faster 



bottom of page