Makeover Monday (#43)

This weeks Tableau Makeover Monday was a challenge to visualise a small amount of data; two data points – total size of US National Debt versus the rest of the world. The original visualisation can be seen on the visualcapitalist website and also include comparisons of the US$ 19.5 trillion debt to thinks like company sizes, oil exports, cash held, etc. The pie chart works well here and the comparisons give some idea of scale.

I’ve stuck to the challenge and just used the two data points. Presented as a tree map but formatted similarly to a credit card – perhaps a good way to make the concept more real?

us-debt

.

[Edit: Was it a good idea to use credit card imagery in this visualisation? Maybe not! As touched on by Andy Cotgreave in his Makeover Monday blog post, national debt is different to house hold debt. A better analogy might be a business loan – borrowing money to build a business (aka grow the economy). If I were revisiting the visualisation I’d also try changing the title and caption to refer to “Sovereign Debt” (with explanatory note) as the term ”global” could be misconstrued as all debt not just national or sovereign debt.]

Makeover Monday (#42)

This weeks #MakeoverMonday was a look at US presidential election forecasting data by Drew Linzer on Daily Kos Elections.

The original charts plot the average percentage being polled by Clinton and Trump over time, along with percentage undecided and other (independents). Personally I wasn’t sure I could improve on the existing charts or some of the community versions (loving the tile maps!) so instead I’ve focussed on a different angle - it wasn’t always easy to see at a glance who was predicted to win the election and why. Particularly with the complexity of the electoral college voting system.

Showing the states each candidate was predicted to win, ordered by the scale of the candidate’s lead with a running sum of electoral votes seems to work well:

votamatic

 

Colour coding those states where the lead is smaller than the percentage undecided adds another potential indicator of confidence allowing us to make our own judgements of the likelihood of a particular result – how many close states would have to swing Trump’s way for him to win at this stage?

I would have liked to have added the overall country wide line chart – like the original – and to have dug deeper into the two states where electoral votes are allocated differently. Animating the chart over time, using the pages shelf, was an interesting exercise that I would have liked to explore further.

The full viz can be found on Tableau Public.

Or check out other community efforts via the #MakeoverMonday webstite.

Makeover Monday (#41)

Having a go at Tableau #MakeoverMonday this week, with a reworking of a FT visualisation of European public transportation satisfaction survey results in 2015. A good opportunity to look into ways to visualise Likert scale survey results, and to practice some table calculations in Tableau! Adding the ranking by country along with an indicator of the number of places gained/lost gives a quick idea of how satisfaction has changed.

Public transport satisfaction in Europe in 2015

View in Tableau Public. Areas for improvement are: the table calculations for last years ranking (could I use LOD here?); tool tips for the diverging stacked bars (some of the other community makeovers do a very nice job here); drill down to city level and the legend.

Visualising LCFCs 2014/15 Season

Leicester City FC defied the odds to avoid relegation from the English Premier League in May. Rock bottom at Christmas and seven points adrift by late March, a resurgence in form saw the foxes to safety with a game to spare. In the following interactive visualisations I look at the club’s results, league position and points over the season, along with player performance data:

LCFC 2014/15 Season (mobile version)

LCFC 2015/15 Season (desktop/tablet version)

 

Static version showing performance for Schmeichel:

LCFC 2014-15 Season Visualisation


 

Can an AI algorithm win fantasy football?

If you’ve heard the term moneyball, then you’ll know that in 2002 the Oakland ‘A’s Major League Baseball team began to use statistical analysis to identify and sign undervalued players, in order to compete against their richer competitors. The approach is credited with getting them to the playoffs in both 2002 and 2003 and has since been adopted more widely.

In the football world, Brentford FC are reportedly embarking on a similar journey using the data-driven approach proven at their Danish sister club.

So could an AI algorithm win fantasy football?

The short answer is probably yes (and maybe already has!), but here is how a less successful attempt went this season!

 

Fantasy Football 101

The English Premier League’s free fantasy football competition lets you pick a squad of 15 players from the various Premier League teams, with the aim of seeing how your squad performs compared to other peoples. There are restrictions on the make up of your squad, including a maximum cost that prevents you loading up with too many top players.

Each week you select a team of 11 players from your squad to “play”, and then score points depending on how well those players perform in actual matches that week. If a player is not performing then you can transfer them out and another player in. There are limits on the number of free transfers that you can make and player costs change depending on demand (making popular players less affordable). There are two opportunities to make unlimited transfers during the season by playing a transfer wildcard.

 

The goal

My goal was to have a fully automated team entered into the competition. The algorithm would select a squad at the start of the season, collect data each week, make any transfers and select the best team to field based on that data. I didn’t achieve full automation this season, and so manually applied aspects of the algorithm instead.

 

The algorithm

Firstly, this algorithm isn’t really AI. It’s more of a first attempt to work with some of the data involved!

Initial squad selection: Step 1 was to select the best combination of players that met the cost and team structure rules. The query I developed attempted to ”brute force” valid combinations of players and then order those by overall score last season. There were some obvious downsides to this approach!

  • The query ignored players that were not involved last season (new signings and those from promoted clubs)
  • The highest scoring combination of players wouldn’t necessarily have won last season (transfers and captain choices play a large part in overall score).
  • The number of combinations considered had to be limited in order for the query to run in a reasonable time
  • The best scoring combination I could identify was some way off both the actual top scores from last season and the theoretical maximum for a squad with no cost restrictions (the best 2 keepers, plus the best 5 defenders, etc.)

On a weekly basis: Gather data, make transfers and pick team.

  • Get latest player performance data
  • Get performance data for my squad
  • Get betting odds for forthcoming games (which teams are most likely to win)
  • Transfer out the player with the worst form and/or least likely to play
  • Transfer in a player with better form from a team with good odds (and within cost rules, etc.)
  • Select a team based on home advantage and odds for the forthcoming games (e.g. favor Chelsea at home)

An initial issue with the transfer approach was that budget bench players would be transferred out until the point where the algorithm could no longer make a useful transfer and expensive players were sat on the bench tying up funds. Consequently I adapted the rule after the first half of the season to generally leave budget bench players alone and only transfer the regular players.

Periodic squad reselection: At two points during the season the squad was reset using the transfer wildcards. I tried different approaches to the selection on each occasion. The variations tried involved:

  • Pre-selecting best value budget bench players (playing regularly and picking up some points)
  • Consider form and value as well as overall points when selecting main players

 

The results

The squad was entered from week two.

It finished third in a small private league of eight (second if you discard the points from the first week).

It had an overall rank of 455,222 out of 3,502,998 (I.e. in the top 10-15%)

And was ranked 5,412 out of those entering in game week two.

 

What would I do differently next time?

The fantasy premier league form calculation wasn’t wholly reliable. Their form calculation is an average over a thirty day period, consequently a player who scored very well 3 or 4 games ago but has since been poor, can still have a relatively high form. Instead it might be better to recalculate form, factoring in how consistently the player is scoring points and/or whether their performance is trending up or down.

Too many selection factors were retrospective – just because a player has done well doesn’t mean that they will continue to do so.  Instead it would be interesting to gather more predictive data (e.g. fan tips) and try to identify the factors that are most linked to future performance.

It would be nice to more fully automate the processing and team management. The tool-set used should allow this to be achieved.

 

Tool-set used

Data was gathered using Python and HTQL and stored in a Microsoft Azure SQL Server database.

Analysis was performed using SQL queries.


Intro to HTQL with Python (2)

Following on from part 1, here is an example of using HTQL to pull data from a table on a webpage.

We’ll use the Wikipedia list of most expensive football transfers as our source web page. You can check out the list here. On viewing the page and the HTML source you’ll see that the first row of the table is a header row and that the “player”, “from” and “to” columns contain quite a bit of HTML in order to provide a link to the player/team and a graphical link to their country. Our HTQL will need to cut through this to just get the data that we want.

 

Using the HTQL browser object

In part 1 we just queried some HTML in a variable. In this case we’ll use the browser object to get the HTML from the actual web page and run a query against that.

  1. Create a new python file – e.g. htql-2-wiki.py
  2. Edit the file in IDLE and enter the following code:
    import htql;
    
    # init a browser and go to the wikipedia page
    b=htql.Browser(); 
    p=b.goUrl("http://en.wikipedia.org/wiki/List_of_most_expensive_association_football_transfers");
    
    # setup the HTQL pattern to query that page
    pattern="<table>1.<tr>2-0{"
    pattern+="player=<td>2.<a>2:TX;"
    pattern+="from=<td>3<a>2:TX;"
    pattern+="to=<td>4<a>2:TX;"
    pattern+="fee=<td>5:TX;"
    pattern+="year=<td>7.<a>1:TX}"
    
    # run the query against the page and loop thru the results
    for r in htql.HTQL(p[0], pattern):
        print(r);
  3. A brief run through of the code:
    • The first part of the code instantiates a browser object and uses the goUrl method to go to the page
    • The second part specifies the query that we’re going to run against the page. The query is built up over multiple lines and says: find me the first table on the page (<table>1); then for the second to the last row of that table (.<tr>2-0); create tuples of the data (indicated by the curly braces) where player = the text within the second hyperlink in the second column (player=<td>2.<a>2:TX); and so on.
    • More on the query syntax in a later post!
    • The final part of the code runs the query against the page, loops over the tuples returned and prints them to screen.
  4. If you run the code (F5) you should see the following output:
    Output of Python HTQL program

 

Outputting the results to a CSV

As well as outputting the results to screen, we can output to a CSV.

  1. Amend the final part of your Python program as follows:
    :
    
    # open a file to put the data in
    file = open("transfers.csv", "wb")
    
    # run the query against the page and loop thru the results
    for r in htql.HTQL(p[0], pattern):
        print(r);
        file.write("\"" + r[0] + "\",")
        file.write("\"" + r[1] + "\",")
        file.write("\"" + r[2] + "\",")
        file.write("\"" + r[3] + "\",")
        file.write("\"" + r[4] + "\"\n")
    
    #close the file    
    file.close()
  2. You’ll note the file open and close and the addition of file.write(…) lines within the loop.
  3. Run this version and see what you get! You should see a transfers.csv file in the same folder as the Python program.

Intro to HTQL with Python (1)

HTQL – Hyper-Text Query Language – is a language for querying and extracting content from HTML pages. If SQL is a language to get data from tables within a database, then HTQL is a language to get data from webpages on the internet. It is useful when you need to pull data from the web and there is no web service available to use. An example might be to pull population statistics from Wikipedia.

Note that the example below uses Python 2.7.3 and HTQL for Python 2.7. I’ve done this because I’m ultimately deploying code to a Microsoft Azure website, which only supported Python 2.7.3 or 3.4.0 at the time of writing, and unfortunately there isn’t a version of HTQL for Python 3.4.0! You may want consider developing with Python 3.3.5 and HTQL for Python 3.3.

 

Download and install Python

  1. Download Python (I downloaded the 32-bit MSI installer):
    https://www.python.org/downloads, or directly:
    https://www.python.org/ftp/python/2.7.3/python-2.7.3.msi
  2. Run the installer, letting it add Python to your path if prompted (if you are not prompted for this then you probably have an existing Python install and may want consider whether or not to update your path manually).

 

Check Python

If you want to double check that Python is working then you can create and run a simple program to output the Python version as follows:

  1. Setup a folder for development
  2. Create a new file in that folder called test.py
  3. Right click that file and you should be able to select “Edit with IDLE” (IDLE is a basic IDE for Python development). You should see a blank window like this:
    Empty python file
  4. Enter the following code:
    import sys
    sys.stdout.write("hello from Python %s\n" % (sys.version,))
  5. And then hit F5 to run the program (or select Run > Run Module from the menu)
  6. A second window should open – the Python Shell – and your program should run:
    Output of Python program

 

Download and install HTQL

Next we’ll download and install HTQL for Python:

  1. Download the HTQL Python library:
    http://htql.net/, or directly:
    http://htql.net/Python27/htql.zip
  2. You’ll probably want to grab the HTQL manual and the HTQL Python  manual whilst you’re there. The first of these describes HTQL itself, the second describes using HTQL within a Python program.
  3. There is only one file in the zip – htql.pyd – extract this file and place a copy in <install location>\Python27\Libs (where <install location> is the location in which you installed Python in the steps above).

 

Run a simple HTQL script

You can check that HTQL is working using the basic example from the HTQL python manual, as described below. Note that this example doesn’t go out to a webpage; it just parses a chunk of HTML that is setup in a local variable.

  1. Create a new python file – e.g. htql-1.py
  2. Edit the file in IDLE and enter the following code:
    import htql;
    page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
    query="<a>:href,tx";
    for url, text in htql.HTQL(page, query):
         print(url, text);
  3. The code above does the following:
    • pulls in the HTQL library – import htql;
    • Sets up a dummy HTML page with three links / anchors.
    • Specifies a HTQL query which says “for all of the anchor <a> tags in the HTML, give me the url / href and the text within the anchor” (tx = the text between the start and end tags).
    • The code then loops through the results of running that query against the dummy page and prints out each url and text found.
  4. If you run the code (hit F5) you should see the following output:
    Output of Python HTQL program
  5. Success!