Pydata Chicago - Work Hard Once - Python Automation

Apr 2

I attended the Pydata meetup event at Arity (222 W Merchandise Mart Plaza Suite 875, Chicago, IL). Great speech by Franklin Sarkett and also learned a lot from the people who asked insightful questions. Appreciate the organizers and Arity for hosting a great event. Everyone enjoyed the great food and the open and innovative environment (In 2016, Allstate spins out a new startup focused on flagging risky drivers).

https://www.meetup.com/PyDataChi/events/249043774/

Abstract: Python is an incredible tool for data science, data engineering, an devops. I would like to present on how to create a tech stack that brings together the different pieces of data science with automation.

Bio: Franklin Sarkett is the cofounder of Audantic Real Estate Analytics. The company predicts home sales for its clients using machine learning algorithms. Previously, Franklin was a data scientist at Facebook and developed an algorithm for the Ads Payments team that increased revenue over $200 million and earned a patent. He has a CS degree from University of Illinois at Urbana Champaign, and a masters in Applied Statistics from DePaul University.

Link to the slides here
Video presentation below

Pydata Chicago - work hard once

Work Hard Once Strategy and Automation applied to building machine learning models Franklin Sarkett April 2, 2018
About me: franklin.sarkett@gmail.com Audantic Real Estate Analytics, co-founder ● http://audantic.com/ ● Audantic provides customized data, analytics, and predictive analytics for machine residential real estate. Facebook ● Data scientist at Facebook and developed an algorithm for the Ads Payments team that increased revenue over $200 million and earned a patent. Education ● CS degree from University of Illinois at Urbana Champaign ● MS in Applied Statistics from DePaul University.
Summary Building machine learning models from data ingestion to productionalization is challenging, with many steps. Of all the steps, feature engineering is the biggest differentiator between models that work and models that do not. Using automation and strategy we can remove some of the most challenging parts, and focus on the area of machine learning that generates the most value: feature engineering.
John Boyd and the OODA Loop The OODA loop is the decision cycle of observe, orient, decide, and act, developed by military strategist and United States Air Force Colonel John Boyd. Boyd applied the concept to the combat operations process. It is now also often applied to understand commercial operations and learning processes. The approach favors agility over raw power in dealing with human opponents in any endeavor. - Wikipedia
Orient (most important) "Orient" is the key to the OODA loop. Since one is conditioned by one's heritage, surrounding culture, existing knowledge and learnings, the mind combines fragments of ideas, information, conjectures, impressions, etc. to generate our orientation. How well your orientation matches the real world is largely a function of how well you observe.
Stages of Machine Learning Feature engineering Data cleaning Model training Observe Get raw data (sql, csv, API) Orient Decide Model evaluation Deployment Act
Two guiding thoughts A mentor of mine at FB was coaching me on our model building. Building models requires domain knowledge, and put as much data into the model as you can. To improve the models, you need to add: ● Data quality ● Data volume ○ Breadth ○ Depth Addressing these concerns takes Feature Engineering to the next level.
Automating the Observe stage Many of the tasks in the observe stage could be classified as DevOps and Data Engineering. My favorite tools to use for data science: ● Docker ● Jenkins ● Luigi
Orient - Feature Engineering “Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering.” — Prof. Andrew Ng.
Orient - Feature Engineering “The algorithms we used are very standard for Kagglers. …We spent most of our efforts in feature engineering. … We were also very careful to discard features likely to expose us to the risk of over-fitting our model.” — Xavier Conort
Orient - Feature Engineering “Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.” — Dr. Jason Brownlee
Orient - Feature Engineering At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used...It is often also one of the most interesting parts, where intuition, creativity and “black art” are as important as the technical stuff. -Pedro Domingos, Prof of CS as University of Washington
Code snippet http://bit.ly/PyDataChi-FeatureEngineering
How do we iterate feature engineering faster? ● Create a pipeline of transforms with a final estimator. ● Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. ● Benefits: ○ Convenience and encapsulation. You only have to call fit and predict once on your data to fit a whole sequence of estimators ○ Safety. Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
Feature extraction
Feature extraction
Feature extraction
The pipeline
Summary Building machine learning models from data ingestion to productionalization is hard. Using automation and strategy we can remove some of the most challenging parts, and focus on the area of machine learning that generates the most value: feature engineering. When we use automation and strategy to remove the most challenging parts of machine learning, we can run through more OODA loops faster, generate better models, learn more about our subject, and deliver more value.
franklin.sarkett@gmail.com

Web Master

https://www.technologyx2.com

Pydata Chicago - Work Hard Once - Python Automation

Pydata Chicago - work hard once

Google Cloud 2018 - Cloud OnBoard - Chicago, IL

SARC Antenna Tower Project 2018 - Schaumburg Amateur "Ham" Radio Club