Analysis of the YouTube social network

Aaryan Mittal
2 min readMay 2, 2023

I recently completed a project based on graphs using Neo4j and also applied a machine learning model on the data. The project is a result of one of my courses named “Social Network Analysis”. Neo4j is a database management system that stores and present the data in the form of graphs, not in tabular a JSON format. Here, the whole data is represented by nodes and there you can create a relationship between nodes. That means the whole database collection will look like a graph, and that’s what makes it unique from other database management system. Neo4j is an excellent technology to use when data having a lot of relationships is to be analyzed.

The aim of this project is to use machine learning techniques to analyze a graph of connected nodes on YouTube and gain valuable insights into the structure and dynamics of the YouTube community. The graph represents the relationships between different channels and users on YouTube, with nodes representing videos or users and edges representing the connections between them. Since the graph is quite huge, with lots of data, applying a machine learning model like SVC on it is very beneficial and effective. This can be observed from the results too.

The project uses a Support Vector Classification (SVC) model to analyze the input graph and identify the relationships between the nodes, such as which videos are most connected to others or which users are most influential in the YouTube community. The collected dataset of YouTube videos and users is large, and this data is used to train and test the SVC model. Different feature engineering techniques, like imputation and handling outliers, and parameter tuning are explored to optimize the performance of the model.

The dataset consists of 1,134,890 nodes and 2,987,624 edges. The average clustering coefficient is 0.0808, and the number of communities is 8,385, with an average community size of 13.50 and an average membership size of 0.10.

The project performs four tasks: extracting the 10 most important nodes in the graph and a subgraph of 1000 nodes with some node probability p, using Hadoop to implement any recent clustering algorithm and finding labels and render grouping of communities in the graph, performing online analytical processing using any appropriate distributed clustering package for Neo4J, and applying Support Vector Machine on distributed environment for link prediction on the stored graph.

To train the data on an SVC model, the data is split into training and testing categories, and the SVC model is applied on the training part to later test with the testing part. The accuracy of the model is 93.4%.

Overall, this project demonstrates the usefulness of machine learning in analyzing complex social networks and provides insights into the structure and dynamics of the YouTube community. The results of this project can be beneficial for content creators, marketers, and YouTube analytics professionals, as it can provide insights into the network structure of the YouTube platform and aid in identifying potential opportunities for collaborations and marketing strategies.

--

--