Practical Machine Learning with Python
As part of my MSc I'm taking a short course in Practical Machine Learning via QA.com.
The first three days were just about basic stats visualisation using Python. It was great to have a refresher - but I would have expected that to be a pre-requisite.
The tutor was excellent - very patient at explaining complex concepts. And the use of Jupyter Notebooks is a gamechanger for taught courses like this.
Ultimately, it was a useful course - although I expected a lot more time to be spent on training for machine learning models rather than the underlying statistics.
These are mostly notes to myself to help consolidate my knowledge - and to provide some more information on the course itself if you are thinking of taking it.
Day 1
All done in Python, pretty standard.
Python
- Create
python3 -m venv SomeName
- Activate
source path/to/SomeName/bin/activate
- Install packages
pip install -U whatever
- Run
python
- To exit the venv,
deactivate
Jupyter Notebooks
- In the venv
pip install -U jupyterlab
- Run it with
jupyter notebook
- Don't forget to "Close and Halt" to stop the notebooks running in the background.
- Don't
pip install
from within Jupyter
Anaconda
Possibly the easiest way to do everything (debatable!) * Install Anaconda for Linux
Why
The process of automatically extracting meaning from data.
Data can be raw and unstructured. Lots of modern data is structured - e.g. tabular, database. But unstructured data - mostly media - isn't easy to classify and extract information from.
Exponential growth of data. 90% of the world's data has been created in the last few months. Rise of sensor data, etc.
Use of modern data science tools to invite others to reproduce your results.
Data Science = Turn data into a valuable asset, gain insight, make decisions and take actions. Data Analyst explains. Data Scientists predict and visualise.
Python basics and Jupyter basics
Different data types. Tuples, Dicts.
Some reasons not to use notebooks.
Basic Python and Markdown syntax.
Stats
The usual intro to mean, median, mode, and basic statistical techniques. Basically enough to make sure everyone is on the same page.
Day 2
Numpy. What it is, how it works, how fast it is. N-Dimensional Arrays
Pandas. Again, overview of the basics. Series and DataFrames.
Day 3
Matplotlib and Seaborn. Again, basic interfaces for drawing graphs and getting data out of DataFrames.
Day 4
Linear regression, best fit. K-Nearest-Neighbours - use of random samples to test the K.
Day 5
Voronoi Diagrams and basic clustering. Mixture models.
What links here from around this blog?