Duration: 1-3 days (8-24 hours)
Python is the top language for data science, and Pandas is the Python library that people turn to most. It combines Python’s ease of use and flexibility with the speed and efficiency of C.
It isn’t hard to learn Pandas, and to do some basic analysis with it. But getting the most out of Pandas can be hard — the library is huge, offering a wide variety of features. It’s often not obvious which is the most efficient and readable way to solve a problem. This has become particularly true in the last few years, now that Pandas supports several back-end storage facilities, including Apache Arrow.
Most Pandas courses want to teach you new techniques. By contrast, this course is designed to help you gain fluency and understanding in techniques you have probably learned, but which you haven’t necessarily had a chance to explore in depth. The assumption is that participants have already participated in a Pandas course.
The course consists of a series of hands-on exercises. Each exercise will require the use of one or more Pandas techniques. After each exercise, we will review and discuss each other’s solutions. The instructor’s role will be to provide help while working to solve the exercises, to explain techniques that participants might not understand well, to walk through the solution to each exercise, and to facilitate the discussion session for each exercise.
The exercises all use real-world data sets, and generally have several parts to them. Topics covered include (but are not limited to):
- Choosing dtypes
- Reading data from CSV and Excel files
- Cleaning data
- Sorting
- Grouping
- Pivot tables
- Selecting rows and columns with .loc
- Multi-indexes
- Boolean indexing
- Dates and times
- String manipulation
- Plotting
- NumPy vs. Apache Arrow