Data science and machine learning in Python

Course length: 3 days (24 hours)

Description: Python is well known as a programming language used in a numerous do- mains — from system administration to Web development to test automation. In recent years, Python has become a leading language in data science and machine learning. Whether you’re looking through logfiles, calculating statistics, finding similarities between documents, or identifying buying patterns among your customers, Python provides numerous tools to solve these problems.

This course introduces the concepts of data science and machine learning, using commonly used open-source tools built in Python. This is not a course in the Python language; partici- pants must already have a good understanding of Python’s basics, including the built-in types, functions, list/set/dict comprehensions, and objects. The course concentrates on the practical use of data-science libraries to understand existing data, and make predictions with new data.

Participants in this course will learn about NumPy, SciPy, Pandas, and Matplotlib as the building blocks, and how to use them to import, transform, and filter data — all essential day- to-day aspects of a data scientist’s job.

The course then turns to machine learning, creating computer-based statistical models that al- low us to make predictions using the powerful, popular, open-source “scikit-learn” library. The course looks at both supervised and unsupervised learning, using real-world data from a variety of domains. Participants will learn how to work with numeric, categorical, and textual data.

An important part of developing machine-learning models is the testing of models, to ensure that they aren’t “overfit” to the training data. Participants will learn what tools scikit-learn provides to test their data, and to check which of their models are the most appropriate.

Participants in this course will be introduced to each of these libraries, how to use them, and (perhaps most importantly) when it is appropriate to use them. There will be numerous hands-on exercises during the course, in which participants will be asked to solve problems using the Python libraries that have just been introduced. Many of these exercises will use real-world data sets and logfiles.

By the end of the course, participants should feel comfortable understanding the Python tools that are available, and how to use them in various types of data analysis.

Audience: This course is aimed at programmers who have day-to-day practical experience working with Python. A basic understanding of statistics will be useful, but not mandatory.

Course syllabus

  • Overview of data science in Python
  • Jupyter notebook as an environment for data-science work (and collaboration)
  • NumPy
    • Data structures
    • Operations
    • Sorting, searching, and retrieving • Boolean indexing techniques
  • SciPy
  • Pandas
    • Series
    • DataFrame
    • Importing and exporting data
    • Filtering data by row and column • Data manipulation
    • Time series
  • Matplotlib
    • Chart types
    • Chart styles
    • Output to a file vs. the screen
    • Multiple plots
    • Standalone Matplotlib vs. integrated with Panda
  • Machine learning
  • What is machine learning?
  • scikit-learn
  • Retrieving and using public data sets
  • Feature selection• What is it?
    • Why is it important?
    • What tools does scikit-learn provide to identify features?
  • Standardization of data
  • Machine-learning algorithms
    • How to choose
    • Why you shouldn’t be too confident
  • Supervised machine learning
    • Training
    • Predicting
    • Avoiding overfit models
    • Evaluating model success
    • Comparing models
    • Using the same classifier with different hyperparameters
  • Supervised classification problems
    • Simple classification
    • Textual classification
  • Supervised regression problems
  • Feature selection
  • Clustering with unsupervised learning
  • Future trends in machine learning