Victoria Firsanova, Department of Mathematical Linguistics, Saint Petersburg State University, Saint Petersburg, Russia
In the inclusion, automated QA might become an effective tool allowing, for example, to ask questions about the interaction between neurotypical and atypical people anonymously and get reliable information immediately. However, the controllability of such systems is challenging. Before the integration of QA in the inclusion, a research is required to prevent the generation of misleading and false answers, and verify that a system is safe and does not misrepresent or alter the information. Although the problem of data misrepresentation is not new, the approach presented in the paper is novel, because it highlights a particular NLP application in the field of social policy and healthcare. The study focuses on extractive and generative QA models based on BERT and GPT-2 pre-trained Transformers, fine-tuned on a Russian dataset for the inclusion of people with autism spectrum disorder. The source code is available to GitHub: https://github.com/vifirsanova/ASD-QA.
Natural Language Processing, Question Answering, Information Extraction, BERT, GPT-2.
Zhenshan Bao, Yuezhang Wang and Wenbo Zhang, College of Computer Science, Beijing University of Technology, Beijing, China
Named entity recognition (NER) as one of the most fundamental tasks in natural language processing (NLP) has received extensive attention. Most existing approaches to NER rely on a large amount of high-quality annotations or a more complete specific entity lists. However, in practice, it is very expensive to obtain manually annotated data, and the list of entities that can be used is often not comprehensive. Using the entity list to automatically annotate data is a common annotation method, but the automatically annotated data is usually not perfect under low-resource conditions, including incomplete annotation data or non-annotated data. In this paper, we propose a NER system for complex data processing, which could use an entity list containing only a few entities to obtain incomplete annotation data, and train the NER model without human annotation. Our system extracts semantic features from a small number of samples by introducing a pretrained language model. Based on the incomplete annotations model, we relabel the data using a cross-iteration approach. We use the data filtering method to filter the training data used in the iteration process, and re-annotate the incomplete data through multiple iterations to obtain high-quality data. Each iteration will do corresponding grouping and processing according to different types of annotations, which can improve the model performance faster and reduce the number of iterations. The experimental results demonstrate that our proposed system can effectively perform low-resource NER tasks without human annotation.
Named entity recognition, Low resource natural language processing, Complex annotated data, Cross-iteration.
Renáta Nagy, Doctoral School of Health Sciences, Department of Languages for Biomedical Purposes and Communication Medical School, University of Pécs, Hungary
The presentation is about the online assessment of English for Specific Purposes. The focus is on online as a possible form of language testing. The topic is up-to-date and its main target is to uncover the intriguing question of validity of online testing. A positive outcome of the study would indicate an optimistic and dazzling future in a number of aspects for not only language assessors but for future candidates as well. Namely, a base online setup which could be used worldwide for online tests. In order to achieve this, the research involves not only the theoretical but also the real, first-hand empirical side of testing from the point of view of examiners and examinees as well. Material and methods include surveys, needs analysis and trial versions of online tests. In this context, the presentation focuses on the possible questions, techniques and approaches of the issue of online assessment which can be used in language lessons as a type of classroom technique, too.
Assessment, Online, ESP, Online assessment, validity, testing.
Yangjie Dan, Fan Xu*, Mingwen Wang, School of Computer Information Engineering, Jiangxi Normal University, Nanchang 330022, China
Dialect discrimination has an important practical significance for protecting inheritance of dialects. The traditional dialect discrimination methods pay much attention to the underlying acoustic features, and ignore the meaning of the pronunciation itself, resulting in low performance. This paper systematically explores the validity of the pronunciation features of dialect speech composed of phoneme sequence information for dialect discrimination, and designs an end-to-end dialect discrimination model based on the multi-head self-attention mechanism. Specifically, we first adopt the residual convolution neural network and the multi-head self-attention mechanism to effectively extract the phoneme sequence features unique to different dialects to compose the novel phonetic features. Then, we perform dialect discrimination based on the extracted phonetic features using the self-attention mechanism and bidirectional long short-term memory networks. The experimental results on the large-scale benchmark 10- way Chinese dialect corpus released by iFLYTEK show that our model outperforms the state-of-the-art alternatives by large margin.
Dialect discrimination, Multi-head attention mechanism, Phonetic sequence, Connectionist temporal classification.
Siddhant Hosalikar1, Saikumar Iyer1, Ankit Limbasiya1 and Prof. Suvarna Chaure2, 1SIES Graduate School of Technology, Mumbai University, India, 2Department of Computer Engineering, Mumbai University, India
Phishing is a type of fraud, in which two actors, attacker and victim take part. The role of attacker is to create a phishing webpage by mimicking as an authorized one and embed the website in an URL or any other media. Detecting malicious URLs (Uniform Resource Locators) is difficult, yet interesting topic because attackers mostly generate the URLs randomly and researchers have to detect them while considering the behaviours behind the generated Malicious URLs. There are various detection schemes exist in anti-phishing area, URL-based scheme is safer and more realistic because of most important perspective: it does not require access to malicious webpage. In this paper, our aim is to provide a comprehensive investigation on detection of Malicious URLs by using Machine Learning algorithms. So, our proposed detection system consists of feature extraction of URLs, algorithms and bigdata technology.
URL, Malicious URL detection, Feature extraction, Machine learning.
TEMITOPE O AWODIJI, Computer Information Science Personnel, California Miramar University, California, USA
Based on Information and Communication Technologies (ICT) fast advancement and the integration of advanced analytics into manufacturing, products, and services, several industries face new opportunities and at the identical time challenges of maintaining their ability and market desires. Such integration, that is termed Cyber-physical Systems (CPS), is remodeling the industry into a future level. CPS facilitates the systematic transformation of large data into information, that makes the invisible patterns of degradations and inefficiencies visible and yields to better decision-making. This project focuses on existing trends within the development of industrial huge information analytics and cps. Then it, in brief, discusses a system architecture for applying cps in manufacturing referred to as 5C. The 5C architecture, comprises necessary steps to totally integrate cyber-physical systems within the manufacturing industry.
Information and Communication Technologies (ICT), Big Data, Analytic, Data, Data Architecture.
Vignav Ramesh1,2 and Anton Kolonin2,3,4, 1Saratoga High School, Saratoga, California, USA, 2Singularity NET Foundation, Amsterdam, Netherlands, 3Aigents, Novosibirsk, Russian Federation, 4Novosibirsk State University, Russian Federation
Many current artificial general intelligence (AGI) and natural language processing (NLP) architectures do not possess general conversational intelligence—that is, they either do not deal with language or are unable to convey knowledge in a form similar to the human language without manual, labor-intensive methods such as template-based customization. In this paper, we propose a new technique to automatically generate grammatically valid sentences using the Link Grammar database. This natural language generation method far outperforms current state-of-the-art baselines and may serve as the final component in a proto-AGI question answering pipeline that understandably handles natural language material.
Interpretable Artificial Intelligence, Formal Grammar, Natural Language Generation, Natural Language Processin.
Olivia-Jade Tribert, Concordia University, Canada
Literature on artificial intelligence, algorithms, how they operate and their epistemic effects on humans have increased significantly in the last decades. If these topics were solely discussed amongst searchers, AI specialists or scholars before, they are slowly making their way into popular discourse. Traditional media outlets, social media threads and filmmakers are now taking part in the discussions: generating their own opinions and public debates. Although the epistemic effects of AI and algorithms on humans are still widely debated, the conclusion remains the same everywhere: digital literacy is indispensable in the 21st century. In a world where everything is run by algorithms, it is crucial for individuals to understand how they work and learn how to think critically about them, as information found online is not always correct and always need to be verified. Machine translation systems (MT) are not an exception to this rule and yet, very little is said about them. In fact, aside from language professionals and a small group of persons, very few people know how to use MT systems efficiently. If the translated sentence reads well, it is simply copied and pasted elsewhere, without a second thought or verification. This paper will touch on two consequences of poor machine translation literacy. Namely, the perpetuation of gender bias and artificial language impoverishment. I will begin by defining what is digital and machine translation literacy. Then, I will show two recent studies that demonstrate gender bias in MT systems such as Google Translate, and the sociolinguistic effects of gender bias in their algorithm. Finally, I will offer some solutions to raise awareness to machine translation literacy that include both language professionals and MT systems engineers.
Lixin Xia1 and Hui Hu2, 1Laboratory of Language Engineering and Computing of Guangdong University of Foreign Studies, China, 2Center for Lexicographical Studies of Guangdong University of Foreign Studies, China
This study aims to investigate the factors that determine the presence or absence of the preposition in in the construction difficulties/difficulty (in) doing in China English and American English respectively, and to explore the difference between the two varieties on the prepositional use. The search strings difficulty (in) *ing and [have] difficulties (in) *ing were retrieved in the Corpus of Contemporary American English (COCA) and then in the Corpus of China English (CCE). By analyzing the statistics of the singular form from two corpora, a conclusion can be made that the construction difficulty in *ing is preferred in formal registers in both English varieties as opposed to informal ones. And the difference on the prepositional use between the two English varieties lies in the process of prepositional gerund being replaced by a directly linked gerund. The results of the plural form indicate that the complexity principle is operative in determining the prepositional use in the construction for the two English varieties. Importantly, the findings of this study have great implications for the assessment of the previous claims about the construction. And the study offers a new insight into the analyses of the construction.
China English, American English, Grammatical Construction, Corpus Linguistics, Natural Language Processing.
Farhan Uz Zaman, Tanvinur Rahman Siam and Zulker Nayen, Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
Deep learning has been very successful in the field of research which includes predictions. In this paper, one such prediction is discussed which can help to implement safe vaccination. Vaccination is very important in order to fight viral diseases such as covid-19. However, people at times have to go through unwanted side effects of the vaccinations which might often cause serious illness. Therefore, modern techniques are to be utilised for safe implementations of vaccines. In this research, Gated Recurrent Unit, GRU, which is a form of Recurrent Neural Network is used to predict whether a particular vaccine will have any side effect on a particular patient. The extracted predictions might be used before deciding whether a vaccine should be injected to a particular person or not.
Deep Learning, Gated Recurrent Unit, Recurrent Neural Network.
Evgeniia Shchepina, Evgeniia Egorova, Pavel Fedotov and Anatoliy Surikov, ITMO University, St. Petersburg, Russia
This paper aims to build a model of users’ interests in the multimodal space and obtain a comprehensive conclusion about users interests. To do this, we build the graphs based on data of separate modalities, find communities in these graphs, consider given communities in a single space further highlighting communities of users’ interests. The constructed model showed better results for analysis of user similarity in comparison with the baseline model. Scientific novelty of our approach is in the proposed method of multimodal clustering heterogeneous data on the interests and preferences of social network users. The distinctive feature of this method from traditional approaches, such as biclustering, is the possibility of flexible scaling the number of initial modalities. As a result, the average performance of the model increased by 12% in accuracy and by 11% in F1-score compared to the baseline model.
Community Detection, Multimodal Space of Interests, Social Network Analysis.
Enshuai Hou, Jie Zhu, Tibet University Tibet Lhasa, China
Tibetan is a low-resource language. In order to alleviate the shortage of parallel corpus between Tibetan and Chinese, this paper uses two monolingual corpora and a small number of seed dictionaries to learn the semi-supervised method with seed dictionaries and self-supervised adversarial training method through the similarity calculation of word clusters in different embedded spaces and puts forward an improved self supervised adversarial learning method of Tibetan and Chinese monolingual data alignment only. The experimental results are as follows. First, the experimental results of Tibetan syllabics Chinese characters are not good, which reflects the weak semantic correlation between Tibetan syllabics and Chinese characters; second, the seed dictionary of semi-supervised method made before 10 predicted word accuracy of 66.5 (Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the self-supervision methods in both language directions have reached 53.5 accuracy.
Tibetan, Word alignment, Without supervision, adversarial training.
Arthur Yosef1, Eli Shnaider2 and Moti Schneider2, 1Tel Aviv-Yaffo Academic College, Israel, 2Netanya Academic College, Israel
This study presents a method to assign relative weights when constructing Fuzzy Cognitive Maps (FCMs). We introduce a method of computing relative weights of directed edges based on actual past behavior (historical data) of the relevant concepts. There is also a discussion addressing the role of experts in the process of constructing FCMs. The method presented here is intuitive, and does not require any restrictive assumptions. The weights are estimated during the design stage of FCM and before the recursive simulations are performed.
FCM, relative importance (weight), Fuzzy Logic, Soft Computing, Neural Networks.
Abdul Musavvir Parappathiyil, Department of Mathematics, Pondicherry University, India
In this article, a linear pentagonal fuzzy number (PFN) is defined. The symmetrical and non-symmetrical PFN pertaining to linear PFN are also defined here. Some basic arithmetic operations such as addition and multiplication of linear PFNs are mentioned here. Moreover, the concept of classical two-dimensional (2-D) pentagonal fuzzy number matrices (PFMs) are also mentioned. In addition, the notion of multidimensional of pentagonal fuzzy number matrices (MDPFMs) is also discussed along with some of its rules and operations like multiplication. Finally, in the light of all rules relating to both 2-D and MDPFMs, we take use of the concept of MDPFMs to solve the fully fuzzy linear system equation (FFLSE) with pentagonal fuzzy numbers as inputs. Two of the methods like singular value decomposition (SVD) method and row reduced echelon (RRE) method are also discussed to solve FFLSE with a numerical example.
MDPFMs, FFLSE for MDPFMs with RRE method, FFLSE for MDPFMs with SVD method.
Valerie Cross and Mike Zmuda, Computer Science and Software Engineering, Miami University, Oxford, OH USA
Current machine learning research is addressing the problem that occurs when the data set includes numerous features but the number of training data is small. Microarray data, for example, typically has a very large number of features, the genes, as compared to the number of training data examples, the patients. An important research problem is to develop techniques to effectively reduce the number of features by selecting the best set of features for use in a machine learning process, referred to as the feature selection problem. Another means of addressing high dimensional data is the use of an ensemble of base classifiers. Ensembles have been shown to improve on the predictive performance of a single model by training multiple models and combining their predictions. This paper examines combining an enhancement of the random subspace model of feature selection using fuzzy set similarity measures with different measures of evaluating feature subsets in the construction of an ensemble classifier. Experimental results show potentially useful combinations.
Feature selection, fuzzy set similarity measures, concordance correlation coefficient, feature subset evaluators, microarray data, ensemble learning.
Zhijun Chen, Department of Financial Engineering, SUSTech University, Shen Zhen, China
Sentiments are extracted from tweets with the hashtag of cryptocurrencies to predict the price and sentiment prediction model generates the parameters for optimization procedure to make decision and re-allocate the portfolio in the further step. Moreover, after the process of prediction, the evaluation, which is conducted with RMSE, MAE and R2, select the KNN and CART model for the prediction of Bitcoin and Ethereum respectively. During the process of portfolio optimization, this project is trying to use predictive prescription to robust the uncertainty and meanwhile take full advantages of auxiliary data such as sentiments. For the outcome of optimization, the portfolio allocation and returns fluctuate acutely as the illustration of figure.
Cryptocurrency Trading Portfolio, Sentiment Analysis, Machine Learning, Predictive Prescription, Robust Optimization Portfolio.
Lakmal Rupasinghe, Kanishka Yapa, Sanjeevan. S, Imaz. M.M.M, Swarnamyuran. T, Vijeethan. S, Department of Computer Systems Engineering, Sri Lanka Institute of Information Technology, Colombo, Sri Lanka
Inter-Hospital and Intra-Hospital patient detail and medical records of patients with long-term illnesses to be handled through the Blockchain technology, which is far more secure and invulnerable than a standard encrypted cloud storage or local database. This transfer is to maintain the continuity of medical care without the hassle of tedious paperwork, physical storage, and retrieval of data. Such important data as medical data of patients must be securely stored without an intruder view or modification. The data must only be available to be accessible to certain authorized personnel. The data analytics functions integrated to a Dapp named Medi-X to the EHR management by considering the data protection policies as GDPR and HIPPA, scalability and interoperability limitations. A comprehensive information will be processed in machine learning in-order to give a doctor or a medical professional better insight on forthcoming illness of patients.
Interoperability, Scalability, EHR, GDPR, HIPPA, MediX, Blockchain, Machine learning.
Dipti Mahamuni, Ira A. Fulton Schools of Engineering – School Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA
The past five years have seen a significant increase in the popularity of Decentralized Ledgers, commonly referred to as Blockchains. Many new protocols have launched to cater to a variety of applications serving individual consumers as well as enterprises. While research is conducted on individual consensus mechanisms and comparison against popular protocols, decision making and selection between the protocols is still amorphous. This paper proposes a comprehensive comparative framework to evaluate various consensus algorithms. We hope that such a framework will help evaluate current as well as future consensus algorithms objectively for a given use case.
Consensus Algorithms, Blockchain, Comparative Framework, Decentralized Ledgers.
Su Liu and Jian Wang, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Ethereum is a public blockchain platform with smart contract. However, it has transaction privacy issues due to the openness of the underlying ledger. Decentralized mixing schemes are presented to hide transaction relationship and transferred amount, but suffer from high transaction cost and long transaction latency. To overcome the two challenges, we propose the idea of batch accounting, adopting batch processing at the time of accounting. For further realization, we introduce payment channel technology into decentralized mixer. Since intermediate transactions between two parities do not need network consensus, our scheme can reduce both transaction cost and transaction latency. Moreover, we provide informal definitions and proofs of our schemes security. Finally, our scheme is implemented based on zk-SNARKs and Ganache, and experimental results show that our method is practicable, and through theoretical and experimental analysis, we can get our scheme performs more well with the higher number of transactions in batch.
Ethereum, transaction privacy, decentralized coin mixer, payment channel, zero-knowledge proof.
K. Abidi and K. Smaili, Loria University of Lorraine, France
In this article, we tackle the issue of sentiment analysis of three Maghrebi dialects used in social networks. More precisely, we are interested by analysing sentiments in Algerian, Moroccan and Tunisian corpora. To do this, we built automatically three lexicons of sentiments, one for each dialect. Each entry of these lexicons is composed by a word, written in Arabic script (Modern Standard Arabic or dialect) or Latin script (Arabizi, French or English) with its polarity. In these lexicons, the semantic orientation of a word represented by an embedding vector is determined automatically by calculating its distance with several embedding seed words. The embedding vectors are trained on three large corpora collected from YouTube. In the experimental session, the proposed approach is evaluated by using few existing annotated corpora for Tunisian and Moroccan dialects. For the Algerian dialect, in addition to a small corpus we found in the literature, we collected and annotated a corpus of 10k comments extracted from YouTube. This corpus represents a valuable resource which will be proposed for free to the community.
Maghrebi dialect, Word embedding, Orientation semantic.
Jessica Salinas, Carlos Flores, Hector Ceballos, and Francisco Cantu, School of Engineering and Sciences, Tecnologico de Monterrey, Monterrey, Mexico
The amount of information that social networks can shed on a certain topic is exponential compared to conventional methods. As new COVID-19 vaccines are approved by COFEPRIS in Mexico, society is acting differently by showing approval or rejection of some of these vaccines on social networks. Data analytics has opened the possibility to process, explore, and analyse a large amount of information that comes from social networks and evaluate peoples sentiments towards a specific topic. In this analysis, we present a Sentiment Analysis of tweets related to COVID-19 vaccines in Mexico. The study involves the exploration of Twitter data to evaluate if there are preferences between the different vaccines available in Mexico and what patterns and behaviours can be observed in the community based on their reactions and opinions. This research will help to provide a first understanding of peoples opinions about the available vaccines and how these opinions are built to identify and avoid possible misinformation sources.
Twitter, Data Mining, Sentiment Analysis, Machine Learning, COVID-19.
Guangjie Li, Yi Tang, Biyi Yi, Xiang Zhang and Yan He, National Innovation Institute of Defense Technology, Beijing, China
Code completion is one of the most useful features provided by advanced IDEs and is widely used by software developers. However, as a kind of code completion, recommending arguments for method calls is less used. Most of existing argument recommendation approaches provide a long list of syntactically correct candidate arguments, which is difficult for software engineers to select the correct arguments from the long list. To this end, we propose a deep learning based approach to recommending arguments instantly when programmers type in method names they intend to invoke. First, we extract context information from a large corpus of open-source applications. Second, we preprocess the extracted dataset, which involves natural language processing and data embedding. Third, we feed the preprocessed dataset to a specially designed convolutional neural network to rank and recommend actual arguments. With the resulting CNN model trained with sample applications, we can sort the candidate arguments in a reasonable order and recommend the first one as the correct argument. We evaluate the proposed approach on 100 open-source Java applications. Results suggest that the proposed approach outperforms the state-of-the-art approaches in recommending arguments.
Argument recommendation, Code Completion, CNN, Deep Learning.
Fernando Lima, Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Brazil
Thanks to the increase in computational resources and data availability, deep learning-based object detection methods have achieved numerous successes in computer vision, and more recently in remote sens-ing. The \You Only Look Once" (or just YOLO) framework is a family of deep learning-based object detection methods that detect targets in a single stage. Single stage detectors provides faster but less accurate detections in comparison to multi-stage detectors. However, the newest versions of YOLO are reported to achieve very accurate results on a range of datasets. This paper details the processes of using the YOLO framework for detecting ships from Synthetic aperture radar (SAR) images using the recently introduced GAOFEN 3 Challenge Dataset. A comparison, between the performances and the improvements in the newest versions of YOLO, is done and the results shows Yolo v5 as a framework with potential to be a big leap towards improving SAR image ship detections performance under limited data.
yolo, object detection, sar, gaofen.
Ouahiba Djama, Lire Laboratory, University of Abdelhamid Mehri Constantine 2, Constantine, Algeria
Search engines allow providing the user with data and information according to their interests and specialty. Thus, it is necessary to exploit descriptions of the resources, which take into consideration viewpoints. Generally, the resource descriptions are available in RDF (e.g., DBPedia of Wikipedia content). However, these descriptions do not take into consideration viewpoints. In this paper, we propose a new approach, which allows converting a classic RDF resource description to a resource description that takes into consideration viewpoints. To detect viewpoints in the document, a machine learning technique will be exploited on an instanced ontology. This latter allows representing the viewpoint in a given domain.
Resource Description, RDF, Viewpoint, Ontology & Machine Learning.
Steven Zhang1 and Yu Sun2, 1Crean Lutheran High School, Irvine, CA 92618, 2California State Polytechnic University, Pomona, CA, 91768
People love to fly drones, but unfortunately many end up crashing or losing them. As the technology of flying drones improves, more people are getting involved. With the number of users increasing, people find that flying drones with sensors is safer because it can automatically avoid problems, but such drones are expensive. This paper describes an inexpensive UAV (unmanned aerial vehicle) system that eliminates the need for sensors and uses only the camera to avoid collisions. This program helps avoid drone crashes and losses. We used the Tello Education drone as our testing drone, which is only outfitted with a camera. Using the camera feed and transmitting that data to the program, the program will then give commands to the drone to avoid collisions.
Machine Learning, Electrical Engineering, Computer Vision, Drone.
Jonathan Liu1 and Yu Sun2, 1Arcadia High School, Arcadia, CA, 91007, 2California State Polytechnic University, Pomona, CA, 91768
Oftentimes, people find themselves staring in the mirror mindlessly while brushing their teeth or putting on clothes. This time, which may seem unnoticeable at first, can accumulate to a significant amount of time wasted when looked at over the duration of a year, and can easily be repurposed to better suit one’s goals. In this paper, we describe the construction and implementation of the Smart Mirror, an intelligent mirror that boasts several features in order to improve an individual’s daily productivity. As the name suggests, it is a mirror, and so will not take anything away from the user when he or she is performing their daily teeth brushing. It also hosts facial recognition, and can recognize one’s emotions from one glimpse through the camera. The Mirror also comes with an app that is available on the Google Play Store, which helps input tasks and daily reminders that can be viewed on the Smart Mirror UI.
Thunkable, Google Firebase, Android, Raspberry Pi.
Copyright © CSITY 2021