Mastering Information Science： Image Processing Research

#pkineto #masterofinformationscience

Trying UNISAL on lecture videos

Surprisingly, it’s working too well
- https://scrapbox.io/files/60573831109cf6001facf365.mp4
The plan below didn’t work out
Now, what should we do?
How about evaluating the importance of lecture videos after using this?
- It would be interesting if we could extract a part of the network like UNISAL.

Meeting on 20200319

What we want to do: Distinguish between important and unimportant timings in the videos.
Method
- Focus on Saliency
- Existing research
  - Saliency evaluation in images
    - A lot of it (Salient Object Detection in the Deep Learning Era: An In-depth Survey)
  - Saliency in videos
    - Decent enough
    - https://youtu.be/JNe6A7dszPw
    - Any issues with lecture videos?
      - (Still testing)
- What we’re trying (long-term) (tentative)
  - First, aim to evaluate the saliency of lecture videos
  - It would be great if we could use that saliency information to evaluate importance
- What we’ve tried
  - http://predimportance.mit.edu/ (covers both natural and artificial)
  - Still in the process of evaluating the saliency of videos
- Weakly supervised learning
  - Trying to transfer the character recognition system, for example
- CAM https://qiita.com/bukei_student/items/698383a7118f95c12cce
It seems more practical and interesting to evaluate the high-saliency areas of lecture videos rather than individual frames.
- With Kineto, we can make student annotations lighter in high-saliency areas, for example
  - This is actually a good idea
- And then, it would be interesting if we could predict saliency a few seconds later
Existing research
- Salient Object Detection in the Deep Learning Era: An In-depth Survey
- Is there saliency in slides or posters?
  - http://predimportance.mit.edu/
  - However, there is no video version of this (haven’t seen any movement)
- Existing video saliency
  - https://youtu.be/JNe6A7dszPw
  - Focusing on moving objects and temporal changes is not possible with images (obviously)
  - But with existing video saliency, it’s difficult to handle artificial things like slides or blackboards
What to do
- Submit a scene from a lecture video to predimportance.mit.edu
- Submit lecture videos to existing saliency evaluation methods
What does it mean for a place in a lecture video to have high saliency (difference from general saliency)?
- It’s likely to be different from sports, for example, in terms of saliency persistence
Q. Can’t we use saliency in lecture images?
- A. It depends on the context, so we need videos
  - Or find a way to convey contextual information
- I want to be more specific about “context”
- For now, let’s run existing methods on lecture videos and find any issues
Things to consider
- What about the teacher?
  - I want to try something like extracting parts of existing models in lecture videos
    - What would be the existing model in lecture videos?
- How fine-grained should the annotation be?
  - Pixel-level or bounding box level, for example
  - Trade-off with computational cost
- Should we deal with audio?
  - AViNet: Diving Deep into Audio-Visual Saliency Prediction deals with it
In the end, what is importance?
- Let’s start by getting our hands dirty and thinking
- Try the complexity-based approach
- Want to focus more on saliency
Idea
- Predict a few seconds ahead in lecture videos
- Based on the teacher’s movements and previous annotations
- Video prediction
  - https://shiropen.com/machine-learning/future-prediction
  - A Review on Deep Learning Techniques for Video Prediction
    - Read later
- Hmm, but it’s a bit tricky
  - Various forms of change- Writing a little bit on the blackboard or having everything already written, like “try it”
Different people have different ways of using animations.

20210312

I haven’t made much progress in my research on image processing because I’ve been busy with final exams, my thesis, and the development of an untrodden product.
The finals are over, and today I finished all the work for the untrodden program.
- From now on, I think I can dedicate more time to my research on image processing.
In the presentation on March 25th, I plan to focus on presenting the developed product and also make as much progress as possible in my research in the remaining two weeks.
Product
- /kineto/What is Kineto
Future of the research
- Focus on lecture videos?
- If it’s lecture videos, we can collect them by filming at school every day, right?
- What I want to do: Distinguish between important and unimportant timing in videos
  - Distinguish between points where the teacher is providing information and points where there is no significant change
- (I haven’t been able to think about the means or search for previous research yet 💦)
- Voice, intonation
Previous research
- https://ieeexplore.ieee.org/document/8269997 (Can’t read)
- [[ SmartPlayer: User-Centric Video Fast-Forwarding]]
- [[
  Content-Aware Dynamic Timeline for Video Browsing]]
  - The approach seems useful.
- https://dl.acm.org/doi/10.1145/2970930.2970933
  - Obtaining information from gestures.
- [[ A bottom-up summarization algorithm for videos in the wild]]

202101021

I couldn’t work on image processing.
I wrote a description for the general public about /kineto/What is Kineto.
Instead of “lecture videos,” I think of it as a “shared blackboard.”
- Placing it as an extension of Jamboard and Miro.

20210115 Meeting

Previous research on Visual Complexity
Novelty of the research
- The visual complexity of video frames.
  - Adaptive Fast Playback-Based Video Skimming Using a Compressed-Domain Visual Complexity Measure
  - It’s doing something similar, but it focuses on spatio-temporal complexity (motion).
Removing the teacher from the video
- It would be good to do it based on line drawings.
  - Reason: By binarizing, we can eliminate variations in light, etc.
    - To be more precise, both solid square and line square become the same thing, approaching human perception of complexity.
https://dronebiz.net/tech/opencv/labeling Labeling process
- Shape recognition?
- Snake
Slide change detection
- When there are lines on the whiteboard that interfere, simply taking the average pixel value doesn’t capture the changes cleanly.
- It seems better to observe changes in complexity, etc. to eliminate interfering frames.
- But even that alone may not be enough.
- It’s difficult even for simple tasks for humans.
I want to confirm my understanding of Visual Complexity Analysis Using Deep Intermediate-Layer Features.
(I haven’t made much progress, so) What to do next
- Try Visual Complexity Analysis Using Deep Intermediate-Layer Features

20210105 Meeting

What I did
- For now, I want to evaluate the “size” of the changes on the slides/blackboard.
- - Detect the timing of slide changes, take the difference before and after the change, and then draw contours.
    - Checking for changes every 30 frames.
  - All the parts become scattered.
    - Tried adding blur (Gaussian Blur)
- Enclose the characters as text using OCR.
- - In terms of ↑, numbers and such may not be that important.
The parts surrounded by yellow indicate the detected changes.
Things I understand:
- Different responses are required for slide changes and animation changes.
- In the case of slides, if the animation is slow, the difference between each frame may be below the threshold.
- It is inconvenient if sudden images are displayed during the slide.
- ⭐️It is difficult to measure the magnitude of the changes.
  - Simply taking the mean of the differences is problematic because of the variation.
  - It is heavily influenced by the number of pixels occupied by the object.
  - For example, even if a white figure appears on a white background, it cannot be detected because the number of changed pixels is small.
- It is also difficult to determine the changes.
  - Animation changes are not a problem, but adding slides is difficult.
  - Detecting slide additions requires conditional branching.
Consult with a teacher.
- It is said that it is better not to change the font size on the blackboard (apparently) (common knowledge among teachers?).
  - It is said that using colors is more suitable for expressing importance.
  - Knowing this may make OCR easier as well.
Direction:
- Classification of changes in “lecture videos” (live lectures, slide lectures, etc.).
  - Persistent differences: writing on the blackboard, slide animations, etc., things that remain afterwards.
  - Temporary differences: teacher’s movements (teacher on the blackboard in live lectures, wipe window in slide lectures), etc.
  - Reset differences: erasing the blackboard, moving slides, etc.
  - (Other noises)
- How to classify:
  - Comparing frame differences can determine whether they are persistent.
  - People and moving objects can use Lucas-Kanade, etc.
- Want to classify differences and evaluate the complexity of persistent differences.
  - Temporary differences become non-semantic due to human bodies, so their complexity may not be meaningful.
  - Image of complexity (comparison when occupying the same area):
    - Simple shapes < maps
    - Monochrome maps < maps with multiple colors
    - Shapes < text
  - Can use evaluation axes specific to slide/blackboard images?
- Want to consider lectures on blackboards (fragmented temporary differences), slide lectures with many animations, and slide lectures with few animations, all within the same framework (linear structure).
- How to actually implement?
- Searching for “complexity” of images doesn’t yield relevant results.
  - If there are better terms, please let me know. 🙏
  - “Wide” and “send”
- Histogram variance, etc.?
  - Outliers
- Additionally, could try line drawing and counting the number of lines.
  - Counting the number of outline lines that intersect with straight lines, etc.
  - Or rotate the straight lines 360 times, etc.
- Previous research:
  - Quantifying the Complexity of Black-and-White Images
  - Adaptive Fast Playback-Based Video Skimming Using a Compressed-Domain Visual Complexity Measure
- Novelty:
  - Doing this with video differences.
  - Adaptive Fast Playback-Based Video Skimming Using a Compressed-Domain Visual Complexity Measure is doing something similar, but it focuses on spatio-temporal complexity (motion, etc.).
  - Mine focuses more on still images than motion.
Obtain prior knowledge by consulting with teachers.
- Various things: what can be skipped (contrasting with speaking speed), etc.
- Basis for changing speed.
As research:
- Tip: If you increase the scale, you can avoid complete failure.
Aim for a form like this:
- Lecta: http://mprg.jp/data/MPRG/F_group/F028_yokoi2005.pdf
  - But it’s from 2005.

20201202 Meeting

There are several studies that select important frames.
- For example, selecting from images on whiteboards, etc.
Is there no evaluation of the importance of each frame?
- We want to change the speed, so we need this.
Also, there are few approaches that focus on “removing unnecessary parts” rather than “selecting necessary parts”.- Is the concept of an elastic timeline unusual? (It may seem obvious after thinking about it for a few months)
- I could only find research on elastic timelines for cows and content-awareness.
Hypothesis: Does the size of the written content determine its importance?
Discussion:
- I want to manipulate the speed by combining it with other elements.
- How should we combine them?
  - There might be a better method than simple addition or multiplication.
Also, for the summary when viewed later:
- What should we do if we can’t catch up even if we try our best to summarize, whether it’s due to being late or not participating?

20201118 Meeting:

Potential target videos:
- Live-action classroom videos
- Slide-based lectures
- Instructional videos using various materials
- (Should we narrow down the scope?)
  - Yes, we should narrow it down.
  - Real-life and lectures have different characteristics.
    - With slide-based lectures, text and charts are important.
    - Audio seems to be crucial.
      - Video: 3-dimensional
      - Audio: 1-dimensional (lightweight)
      - If we can vectorize them, we can use the same framework.
    - Real-time situations make deep learning challenging.
      - Especially with video processing.
Goal: Bend the lines of this graph.
- Change the speed.
  - Find parts that can be changed in speed without affecting understanding of the content.
  - Evaluation criteria could be how humans perceive it.
- Skip unnecessary parts.
  - Apply video summarization techniques (weaken them?).
- Utilize information from previous studies.
- There might be other methods that we haven’t thought of yet.

Shutaro Aoyama

Explorer

Mastering Information Science： Image Processing Research

Graph View

Backlinks