Assignment Briefing Sheet

This Assignment assesses the following module Learning Outcomes (from Definitive Module
Successful students will typically:

  1. be able to appreciate the strengths and limitations of various data mining models;
  2. be able to critically evaluate, articulate and utilise a range of techniques for designing data mining
  3. be able to critically evaluate different algorithms and models of data mining.
    Assignment Brief:
    A dataset of text is provided in the assignment area on Canvas. Analyse this data using the WEKA
    toolkit and tools introduced within this module, comparing two different forms of preprocessing: For
    example, you may investigate the impact of using stemming, the effect of reducing the number of
    features, the impact of term frequency over a simple word count, etc.
    Complete the following tasks:
  4. Describe which question you will be investigating (e.g. “is stemming beneficial to improving
    performance?”, “is the reduction of features beneficial to improving performance?”, etc.) and why
    you think your choice is an interesting question to investigate.
  5. Convert the text dataset into TWO different databases in ARFF format, based on your chosen
    question. Explain the conversion techniques and parameters that you have used, along with any
    other pre-processing you wish to do. (Do not include a screen shot of the attributes in WEKA –
    you need to describe them.)
  6. For each database, produce a table and a graph of classification performance against training
    set size for the following three classifiers: decision-tree (J48), Naïve Bayes, Support Vector
    Machine. For the Support-Vector Machine you must determine the kernel,and its parameters.
  7. Write a conclusion. You should at least compare the performance of the different learning
    algorithms on your databases, and answer the question you posed in part (1).
    Remember to explain the steps you have taken to complete each task in your report. Screenshots
    are typically not required, and should be used sparingly if at all.
    Submission Requirements:
    A single PDF document containing your report, to a maximum 10 pages.
    Marks awarded for:
    Marks will be awarded out of 100 in the proportion:
  8. Question (5 marks)
  9. Conversion (40 marks)
  10. Training/testing (40 marks)
  11. Conclusion (15 marks)
    A reminder that all work should be your own. Reports exceeding the maximum length may not
    be marked beyond the 10 pages.
    Type of Feedback to be given for this assignment:
    Along with the marks, each student will receive individual written feedback on the online platform
