#pkineto#masterofinformationscience

Trying UNISAL on lecture videos

  • Surprisingly, it’s working too well
  • The plan below didn’t work out
  • Now, what should we do?
  • How about evaluating the importance of lecture videos after using this?
    • It would be interesting if we could extract a part of the network like UNISAL.

Meeting on 20200319

  • What we want to do: Distinguish between important and unimportant timings in the videos.

  • Method

  • It seems more practical and interesting to evaluate the high-saliency areas of lecture videos rather than individual frames.

    • With Kineto, we can make student annotations lighter in high-saliency areas, for example
      • This is actually a good idea
    • And then, it would be interesting if we could predict saliency a few seconds later
  • Existing research

  • What to do

    • Submit a scene from a lecture video to predimportance.mit.edu
    • Submit lecture videos to existing saliency evaluation methods
  • What does it mean for a place in a lecture video to have high saliency (difference from general saliency)?

    • It’s likely to be different from sports, for example, in terms of saliency persistence
  • Q. Can’t we use saliency in lecture images?

    • A. It depends on the context, so we need videos
      • Or find a way to convey contextual information
    • I want to be more specific about “context”
    • For now, let’s run existing methods on lecture videos and find any issues
  • Things to consider

    • What about the teacher?
      • I want to try something like extracting parts of existing models in lecture videos
        • What would be the existing model in lecture videos?
    • How fine-grained should the annotation be?
      • Pixel-level or bounding box level, for example
      • Trade-off with computational cost
    • Should we deal with audio?
  • In the end, what is importance?

    • Let’s start by getting our hands dirty and thinking
    • Try the complexity-based approach
    • Want to focus more on saliency
  • Idea

  • Different people have different ways of using animations.

20210312

  • I haven’t made much progress in my research on image processing because I’ve been busy with final exams, my thesis, and the development of an untrodden product.

  • The finals are over, and today I finished all the work for the untrodden program.

    • From now on, I think I can dedicate more time to my research on image processing.
  • In the presentation on March 25th, I plan to focus on presenting the developed product and also make as much progress as possible in my research in the remaining two weeks.

  • Product

  • Future of the research

    • Focus on lecture videos?
    • If it’s lecture videos, we can collect them by filming at school every day, right?
    • What I want to do: Distinguish between important and unimportant timing in videos
      • Distinguish between points where the teacher is providing information and points where there is no significant change
    • (I haven’t been able to think about the means or search for previous research yet 💦)
    • Voice, intonation
  • Previous research

202101021

  • I couldn’t work on image processing.
  • I wrote a description for the general public about /kineto/What is Kineto.
  • Instead of “lecture videos,” I think of it as a “shared blackboard.”
    • Placing it as an extension of Jamboard and Miro.

20210115 Meeting

20210105 Meeting

  • What I did
    • For now, I want to evaluate the “size” of the changes on the slides/blackboard.
    • image
      • Detect the timing of slide changes, take the difference before and after the change, and then draw contours.
        • Checking for changes every 30 frames.
      • All the parts become scattered.
        • Tried adding blur (Gaussian Blur)
    • Enclose the characters as text using OCR.
    • imageimage- In terms of ↑, numbers and such may not be that important.
  • The parts surrounded by yellow indicate the detected changes.
  • Things I understand:
    • Different responses are required for slide changes and animation changes.
    • In the case of slides, if the animation is slow, the difference between each frame may be below the threshold.
    • It is inconvenient if sudden images are displayed during the slide.
    • ⭐️It is difficult to measure the magnitude of the changes.
      • Simply taking the mean of the differences is problematic because of the variation.
      • It is heavily influenced by the number of pixels occupied by the object.
      • For example, even if a white figure appears on a white background, it cannot be detected because the number of changed pixels is small.
    • It is also difficult to determine the changes.
      • Animation changes are not a problem, but adding slides is difficult.
      • Detecting slide additions requires conditional branching.
  • Consult with a teacher.
    • It is said that it is better not to change the font size on the blackboard (apparently) (common knowledge among teachers?).
      • It is said that using colors is more suitable for expressing importance.
      • Knowing this may make OCR easier as well.
  • Direction:
    • Classification of changes in “lecture videos” (live lectures, slide lectures, etc.).
      • Persistent differences: writing on the blackboard, slide animations, etc., things that remain afterwards.
      • Temporary differences: teacher’s movements (teacher on the blackboard in live lectures, wipe window in slide lectures), etc.
      • Reset differences: erasing the blackboard, moving slides, etc.
      • (Other noises)
    • How to classify:
      • Comparing frame differences can determine whether they are persistent.
      • People and moving objects can use Lucas-Kanade, etc.
    • Want to classify differences and evaluate the complexity of persistent differences.
      • Temporary differences become non-semantic due to human bodies, so their complexity may not be meaningful.
      • Image of complexity (comparison when occupying the same area):
        • Simple shapes < maps
        • Monochrome maps < maps with multiple colors
        • Shapes < text
      • Can use evaluation axes specific to slide/blackboard images?
    • Want to consider lectures on blackboards (fragmented temporary differences), slide lectures with many animations, and slide lectures with few animations, all within the same framework (linear structure).
    • How to actually implement?
    • Searching for “complexity” of images doesn’t yield relevant results.
      • If there are better terms, please let me know. 🙏
      • “Wide” and “send”
    • Histogram variance, etc.?
      • Outliers
    • Additionally, could try line drawing and counting the number of lines.
      • Counting the number of outline lines that intersect with straight lines, etc.
      • Or rotate the straight lines 360 times, etc.
    • Previous research:
    • Novelty:
  • Obtain prior knowledge by consulting with teachers.
    • Various things: what can be skipped (contrasting with speaking speed), etc.
    • Basis for changing speed.
  • As research:
    • Tip: If you increase the scale, you can avoid complete failure.
  • Aim for a form like this:

20201202 Meeting

  • There are several studies that select important frames.

    • For example, selecting from images on whiteboards, etc.
  • Is there no evaluation of the importance of each frame?

    • We want to change the speed, so we need this.
  • Also, there are few approaches that focus on “removing unnecessary parts” rather than “selecting necessary parts”.- Is the concept of an elastic timeline unusual? (It may seem obvious after thinking about it for a few months)

    • I could only find research on elastic timelines for cows and content-awareness.
  • Hypothesis: Does the size of the written content determine its importance?

  • Discussion:

    • I want to manipulate the speed by combining it with other elements.
    • How should we combine them?
      • There might be a better method than simple addition or multiplication.
  • Also, for the summary when viewed later:

    • What should we do if we can’t catch up even if we try our best to summarize, whether it’s due to being late or not participating?

20201118 Meeting:

  • Potential target videos:

    • Live-action classroom videos
    • Slide-based lectures
    • Instructional videos using various materials
    • (Should we narrow down the scope?)
      • Yes, we should narrow it down.
      • Real-life and lectures have different characteristics.
        • With slide-based lectures, text and charts are important.
        • Audio seems to be crucial.
          • Video: 3-dimensional
          • Audio: 1-dimensional (lightweight)
          • If we can vectorize them, we can use the same framework.
        • Real-time situations make deep learning challenging.
          • Especially with video processing.
  • Goal: Bend the lines of this graph.

    • Change the speed.
      • Find parts that can be changed in speed without affecting understanding of the content.
      • Evaluation criteria could be how humans perceive it.
    • Skip unnecessary parts.
      • Apply video summarization techniques (weaken them?).
    • Utilize information from previous studies.
    • There might be other methods that we haven’t thought of yet.